“1e32jnd9312”, “32189321-DEF3123-9898312”, “ADEFi382819312.” Do these strings seem familiar? They could be hashes, random generated passwords, API keys, or many other types of strings. You can usually spot them in logs, command lines, configuration files, and source code. Whether you are analyzing security and application logs or you are hunting for accidentally exposed credentials, they can, unfortunately, make your life a lot harder. This is because building a search pattern for something random is a particularly hard task.
Stringlifier is our latest open source project and it can help you in tackling this often difficult task. The project is an open-source python package that allows you to detect code/text that resembles a randomly generated string in any plain text. It uses machine learning to distinguish between normal and random character sequences. It can also be adapted for more fine-grained classifications (password, API key, hash, etc.).
The entire source-code is available now in Adobe’s public Github repository. We also provide a “pip” (Python package installer) installation package that includes a pre-trained model.
We did our best to make Stringlifier as easy-to-use as possible. To get started, you can simply install the module using pip.
$ pip install stringlifier
After this, all you have to do is import the API, create a new instance, and pass any string through it:
from stringlifier.api import Stringlifier
s = stringlifier('/System/Library/DriverExtensions/AppleUserHIDDrivers.dext/AppleUserHIDDrivers com.apple.driverkit.AppleUserUSBHostHIDDevice0 0x10000992d')
In this simple example, the results (stored in s) should be:
‘/System/Library/DriverExtensions/AppleUserHIDDrivers.dext/AppleUserHIDDrivers com.apple.driverkit.AppleUserUSBHostHIDDevice0 <RANDOM_STRING>’
And happening under the hood:
“0x10000992d” was replaced by a token labeled “<RANDOM_STRING>”.
In some of our previous blogs we spoke about finding anomalies in different datasets and we also introduced an open source tool at that time called Tripod to help. In many cases multiple datapoints contain long strings which we have to pre-process and convert into a numerical form before we can feed them into machine learning models. We have done this using a few approaches: BLEU Scoring with custom clustering, TF-IDF with bag of words, TF-IDF using a byte-pair-encoding (BPE) approach and K-Means on top of it, and others. Grouping strings into robust clusters is really important for any of these approaches. But, we have always hit a roadblock: random strings. Depending on the size of the random string compared with the string itself, it might influence the result of the clustering algorithm. This can disrupt how the data is going to be grouped.
For example, we are currently working to detect anomalies in datasets generated by one of Adobe’s other open source projects in daily active use here, HubbleStack:
Let’s take the following command line as an example:
string = ”/run/torcx/bin/docker --config /var/lib/mesos/slave/slaves/db2bb0dd-12b0-4167-a1cc-23ef4a4a4211-S1196/frameworks/db2bb0dd-12b0-4167-a1cc-23ef4a4a4211-0001/executors/bladerunner-sysdig.ec491d2d-b02e-11ea-899e-86449ab0c296/runs/162e0529-7244-4f9b-aaff-dd32d015514e/.docker run --privileged --userns=host -v /var/run/docker.sock:/host/var/run/docker.sock -v /dev:/host/dev -v /proc:/host/proc:ro -v /boot:/host/boot:ro -v /lib/modules:/host/lib/modules:ro -v /usr:/host/usr:ro internaladobeurl.com/url/url:1.0.0”
This is a valid command line. However, if you take into consideration all the UUIDs present here, it becomes a total mess. Stringlifier can help us clean it up really fast:
s = stringlifier(string)
“'/run/torcx/bin/docker --config /var/lib/mesos/slave/slaves/<RANDOM_STRING>-S1196/frameworks/<RANDOM_STRING>-0001/executors/bladerunner-sysdig.<RANDOM_STRING>/runs/<RANDOM_STRING>/.docker run --privileged --userns=host -v /var/run/docker.sock:/host/var/run/docker.sock -v /dev:/host/dev -v /proc:/host/proc:ro -v /boot:/host/boot:ro -v /lib/modules:/host/lib/modules:ro -v /usr:/host/usr:ro internaladobeurl.com/url/url:1.0.0”
All of the random character sequences where replaced with <RANDOM_STRING>. This makes it easier to group similar types of command lines that employ random hashes in their parameters but will otherwise have an identical behavior and scope. Also, as a nice addition, the machine learning model caught that “0001” and “S1196” are not part of random strings.
We hope you find stiringlifier useful. The entire source-code is available in Adobe’s GitHub repository. You can also find all of our other open source projects from across Adobe’s security teams in that repository. We look forward to getting feedback and contributions are always welcome.
Data Scientist/Machine Learning Engineer
Sr. Security Engineer