Using Machine Learning to More Quickly Evaluate the Threat Level of External Domains

Major InitiativesOngoing ResearchSecurity Automation

Most antivirus (AV) software is designed for home/personal use. It can cover common scenarios. However, corporate networks must deal with preventing potential targeted attacks. These “attacks”? are oriented at confusing the security policy and standard AV vendors. Relying solely on external vendors to provide good security measures likely means ignoring the profiles of attackers that target your specific infrastructure. Also, the size of the internal network is directly proportional to the number of artifacts that need to be checked. This raises the issue of potential API limitations and scanning capacity when relying solely on AV providers. Thus, a complete security solution must rely on both external vendors and on the threat-specific model for your own company. 

Our scope

Whether we are talking about unsuspecting users visiting malicious websites or being tricked into clicking malicious links in phishing emails, innocent user behaviors are a common vector for attacks. Attackers are also getting smarter by more directly monitoring these types of interactions to determine how the malicious attempts can be made more effective. Depending on the size network, an organization could potentially get several hundreds of thousands of unique domains that are accessed on a daily basis. Luckily, you don’t have to check all of them – you can take an incremental approach of only checking newly observed ones. Of course, a resource that was previously vetted could become compromised over time, but, with a healthy cache, you can re-add them into the cycle and check them periodically. This could potentially keep incrementing the number of domains you have to check in the long run, but keep in mind the dynamic behavior of browsing: some of the domains are only accessed once (the user never returns), other’s become trending and are frequently seen (you would expect that their owners also take security seriously and help make sure that they are not compromised), and domains that were trending at some point could  go back to anonymity and are not accessed anymore. In short, you may have to scale at some point, but it’s going to take a while before you get to that point.

But what are newly observed domains all about? For most of the AV vendors and passive DNS companies, newly observed domains represent web entities that were registered recently – a couple of days or maybe a couple of hours ago. This information might be an indicator that the domain might be worth investigating.

For the Adobe security team, NODs (newly observed domains) represent web resources that were accessed from Adobe infrastructure or from Adobe owned devices and that were never seen before in the Adobe logs. This means we are looking for NODs from an Adobe perspective and not from a broader web perspective. Now, even if you only focus on newly observed domains, you could still end up with tens of thousands per day. This is just how it is. This domain number is still likely to go beyond most API limitations of queries per day. Needless to say, you need a quicker way of reducing the number of domains scanned.

Generating NODs

Let us take a quick look at what NODs represent. As we mentioned we are interested in seeing what new domains are contacted daily. We start this approach by looking through a variety of logs sources, like proxy, DNS, or EDR generated logs.

In order to initialize the process, we take the logs for a large period of time (1 month) and make a statistic of all domains contacted in this period that are not a part of the Cisco Umbrella [CL1] “Top 1 Million” dataset. The result is going to be our initializer.

Then, the next day we will check all the domains that were contacted (using the same log sources) and that they are not in the Cisco Top 1 million, or in the data collected previously. That generates the first day of NODs. We add those results to the initializer and do the same approach the next day.

This way, we have daily a list of all the unique newly observed domains, from an Adobe perspective.

Knowing versus guessing

There are two broad categories of methods aimed at detecting if an item (domain, binary, script, etc.) represents a potential threat or not: signature-based methods and heuristic-based methods. Signature-based methods require prior identification and labeling of a potential threat, followed by hashing or signature creation. These types of methods are useful for accurately detecting well-established threats and they are less prone to error. However, emerging threats are best addressed by heuristic methods. These look for common threat behavioral patterns or try to detect anomalies in the way software/scripts/sites work.

Heuristic techniques employ either hand-crafted rules or automatically generated rules/patterns. This, in turn, can be supported by machine learning methods. Whatever the case, heuristic methods have a higher chance of generating false-positives – mainly because of the blurred line between what is normal and what is abnormal software behavior. This includes websites that send data to third party entities, text-editors editing registry keys and accessing low-level system (kernel) functions, and plugins with a common behavior of hooking or loading libraries and sometimes even overwriting standard function calls. Thus, unless you can afford to rely solely onsecurity policies that could potentially slow down the process, heuristically reported artifacts must be manually curated. 

This brings us to another issue regarding the manual testing capacity of any company. The number of investigations that can be carried out in a specific timeframe is relatively small compared to the number of false positives generated by many heuristic methods. With just one source of truth from any open-source (OS) or commercial vendor is likely to overflow the work capacity of the security analyst team. To help mitigate this issue, we set out to create our own detection pipeline for the newly observed domains. This has the goal of reducing the number of false-positives and reported artifacts to a more manageable number. 

The solution involves three machine learning methods that independently assess if an artifact (in our case domain) is potentially malicious or not based on linearly independent features extracted from various vendors and OS data. Figure 1 shows the generic architecture of our system.

[NOTE: All data shown in this article is hypothetical training data that was used to prove the model and application.]

Importantly, the training data was specifically selected to be disjoint, in order to reduce the common bias of the three classifiers and the effect of mislabeled data from our sources of truth.

Here are some useful stats resulting from the training data (see Table 1, below). The total number of examples is around 2M examples with 1.1M being benign examples and almost 9M malign.

Whenever possible, we tried to obtain the subclassification of malicious domains. As one can see from Table 1, we also include examples of domains related to mining in our dataset. The examples labeled as unknown (UNK) are either domains for which we could not get any subclassification in the data-sources, or domains for which different sources had contradicting labels. In a real-world application, these would need to be investigated further. 

Drill-down into the system

Figure 2 above shows the generic architecture of our detection pipeline. As already mentioned, our system relies on three classifiers that independently determine if a domain is malicious or not:

  1. Naming: the choice of words/letters used in the FQDN provides important information such as: (a) is the domain name the product of a domain-generation-algorithm (DGA) or is it manually created; (b) what was the domain created for (simple news, blogs, presentation sites or adult content, phishing, malware, freemium, ransomware)
  2. Meta information: indicators such as google page-rank score, number of subdomains, geographic coverage etc., reflects the legitimacy of a domain.
  3. Access profiles: trends in global access to the domain and specifically gaps or newly created domains with high-frequency access are extremely useful in the analysis.

Note: This type of approach is not new. However, by building our own system, we were able to focus on potential threats that are more pertinent to Adobe. We carefully selected the training data and included everything in our internal Threat Intelligence Platform. 

After experimenting with several machine-learning methods and techniques we finally decided to use (a) a random forest for both meta-information and access profiles and (b) a character-level unidirectional Long-Short-Term-Memory (LSTM) network at character level for the domain-name classifier. 

We experimented with several normalization methods for the meta and access profile classifiers, but the best results were obtained with raw values. For the domain-name classifier, best results were obtained by feeding the characters in reverse order and using the last cell state for classification, while also back-propagating an auxiliary loss for the subclass, which was masked of the UNK labels. 

At least for two of the classifiers, the results on the development set look promising. However, we are talking about synthetically generated data. This poses two issues:

  1. The distribution of malicious/benign examples in the dataset does not reflect that of real data, thus the results are biased;
  2. We don’t know how data from external sources was collected by the third parties. They could have used some heuristics in their search, and this could lead our own classifiers simply picking up the same heuristics which would likely lead to an incorrectly increased accuracy figure on the development set (since it shared the same traits and the training dataset)

The best thing is to assess how the classifiers react to real-life data. However, there is no source of truth for this data, which makes it impossible to compute any accuracy or F-Score automatically. To give some insights, Table 2 below shows the number of alarms triggered for the newly observed domains in a single day, in the following scenarios: (a) each individual classifier used on its own; (b) each pair of two classifiers voting the same verdict and (c) all three classifiers unanimously agreeing on the verdict. 

SetupPercent of detections 
Naming23.10 % 
Meta31.51 % 
Access16.29 % 
Naming+Meta15.16 % 
Naming+Access8.30 % 
Meta+Access12.13 % 
Naming+Meta+Access6.54 % 

Obviously, the smallest number of artifacts is generated by the unanimous vote of the three classifiers. While this is still a large number to investigate manually, it is far more manageable by sandbox testing APIs without exceeding daily quota. By putting the “Naming+Meta+Access” domains through the sandbox testing service, we came up with a short list of 11 domains which were manually checked.

We have begun employing these techniques to help us do much more efficient evaluations of external resources and their relative threat level. This information combined with intelligence from our other efforts such as Tripod continue to improve the robustness of our overall threat intelligence modeling. 

Tiberiu Boros
Data Scientist/Machine Learning Engineer

Andrei Cotaie
Senior Security Engineer

Kumar Vikramjeet
Security Engineer

Major Initiatives, Ongoing Research, Security Automation

Posted on 05-07-2020