Leveraging Graphs to Improve Security Automation and Analysis

Major InitiativesOngoing ResearchSecurity Automation

(Peleus is one of our Adobe Security Training and Advancement Program Black Belts – the highest security achievement at Adobe. It reflects thousands of hours of applying security principles, learning, and knowledge sharing. Black Belts partner with the Security organization as strong advocates and drivers of a strong security culture at Adobe.)

In my last blog, I gave the background for a research project where I am using graph databases to create graphs of application metadata to improve the efficiency of security automation.  In this blog, we will look at a few theoretical graphs to show their potential value. Essentially, what we are building in this series is a social network-style graph for your applications. Each application is a “profile” where you record its metadata and its connections throughout your organization. By interconnecting all of the application security and service-related data from across your organization, you can obtain far greater context regarding potential security risks than is possible with many existing techniques.  

Many social media platforms have a graph database as their back end since it is the most logical representation for what they are trying to achieve. People on social media networks are typically represented as nodes. Their potential likes and interests are also represented as nodes. The graph will use labeled edge to express the context of the connections between the people with their interest.  Edges can even have weights to express the strength of the connections.

https://creativecommons.org/licenses/by-sa/3.0/

These graphs can allow social media platforms to create a contextual view of a person by interconnecting the different parts of their life. We can use this same approach to create a better contextual understanding of our services at both the network and application layer. By having the full context, we can make better judgments in both automated and human review.

For instance, most firewall rules are just a list of IPs. If you work with a single network, then you probably have your network’s IPs memorized. However, if you work in a central security team with multiple networks, then that becomes difficult.

Therefore, as hypothetical example, a review of firewall rules might give you the following limited amount of information regarding the allowed service to service communication between two public services:

    1.2.3.4 makes egress port 80 connections to 5.6.7.8

The use of port 80 is a bad security practice but, without context, it is hard to know the exact severity.  Instead, it would be more useful to have this information during a proactive review:

   1.2.3.4 (Image Processing Service) makes egress port 80 connections to 5.6.7.8 (auth.foo.com)

With this added context, it is now clear that the Image Processing Service is making an insecure network communication to an authentication service which would be significantly bad. Obviously, you wouldn’t run an auth service on port 80, but this hypothetical example illustrates the point. The added context of who and what those IPs represent provide context with which to make more informed decisions.

Obtaining that context requires mapping the IP to the account ID, domain name, and project.  In addition, you would want to be able to map the project to an owner for quick remediation. If you work solely within a single account within a single cloud provider, such as AWS, then you can maybe get this information from Route53, AWS tags, and a collection of API queries. However, more complex environments will have multiple account IDs and may also be deployed across multiple cloud providers. This means that you must collect the information from multiple sources, and this is where creating a graph can help you create those links. A graph would allow you to build a model that might look like the image below:

Similarly, when doing an application layer security review, it would be ideal to have the complete context of an application in order to view it from a “single pane of glass”.  This requires the ability to make connections between third-party library trackers, static analysis tools, JIRA entries, GitHub projects, cloud providers, and several other data sources. A graph representation of what the relations between these data sources might look like this image below:

In these sample graphs, the edge labels, weights, and directions have been excluded for simplicity.  The important takeaway is that it is possible to easily traverse from any point on the graph to any other point on the graph. Therefore, if you want to dig into a finding from a static analysis tool, then you can link that finding to the IP and domain where it is hosted for further testing as well as to the JIRA location where you will need to file any potential bugs. In the bottom left hand corner, you can see that it is also possible to link this initial project with other projects that share matching properties. These types of connections would allow you to perform a breadth-based search across multiple projects based on the GitHub organization, cost center, or management contact. Graphs provide multiple options for contextualizing data from the perspective on any given data point.

Another important advantage of graphs is that building the graph doesn’t require you to have all your information neatly pre-organized. Many of the tools that contain the information used to build Adobe’s graphs are not designed to be cross-referenced. Instead the graph is built like a puzzle. You start by creating small sub-components of the graph from the individual tools. At first, you are not quite sure how it will fit together with all the other pieces. However, you keep connecting smaller structures into a larger structure whenever you find a common property to create the final picture.

For instance, perhaps the project name someone put into the static analysis tool doesn’t exactly match any of the names found in GitHub. It is not uncommon to for someone to use an internal code name for the project in one tool and the public name of the project in another tool. This makes it impossible for code to link the two together based on their name. However, if both data sources used the same GitHub URL, then you will be able to connect the two sources in the graph.

Chances are there will be some amount of redundancy which means that if you couldn’t immediately link something based on one data source, you will be able to link it later using another data source. For instance, in the diagram above, there are two connections going into the domain name node. However, let’s say that for a given project, it was not possible to tie a project name directly to the domain name with a direct connection. That would be fine so long as the secondary path between using the IP address to bridge the AWS Account ID and the domain name. The more data sources that you add, the more links you will be able to find which creates redundancy.

This gives you the freedom not to worry about whether something connects at any individual stage of the process. Instead, you just keep adding data from different sources and adding links as you find them. You only worry about the final graph once all the data is added into the database.

Once the graph is created, any association that you can create in your mind by looking at a picture of the graph, you can also make via code. Using a graph query language, you can ask the question, “What is the domain name that is associated with this static analysis project?”  The process will start at the node for the static analysis tool’s project and walk the graph until it finds the corresponding domain name node. These queries will not wander off into other projects if you were careful about the direction of your edges when creating the graph. The details of graph design will be discussed later in this series. The overall point is that a single query can make a dynamic correlation that might otherwise have taken multiple cross-references in a more traditional approach.

This makes graphs very powerful for both manual review and automation. With automation, you can use the data from your static analysis tool to trigger dynamic testing against domains in a targeted fashion. If you are performing a manual review, contextual information can allow you to work more efficiently as shown with the network example. In addition, by having all your security tool references as members of the graph, it is possible to create a “single pane of glass” UI for viewing all the relevant security information for a given project. You can examine the basics at a high level and then deep dive into respective tools using the reference recorded in the graph.

Overall, the graphs allow you to perform your analysis, either manually or through automated tools, with more efficiency and greater context. Rather than having security information dispersed across your organization in isolated silos, you can connect all of it into complete application profiles that provide you with all the necessary information to have a complete beginning to end workflow for analyzing and triaging issues. In my upcoming blogs, I will detail how you can build and query these graphs within your organization.

Peleus Uhley
Principal Scientist


Major Initiatives, Ongoing Research, Security Automation

Posted on 06-23-2020