Correlation Defined: Solving Mysteries
I’m going to get a bit nerdy right off the bat this time around. I generally wait until the second or third paragraph before I let my nerd flag fly, but this time there are some things I need to define to the average reader. The topic (as you can see by the title) is correlative data. In statistics, we have what is called positive and negative correlative data. A positive correlation means that as one specific piece of data reaches a positive measure, another follows suit. For example, let’s say we have a rocket on the Moon, and for some unknown reason, we decide we are going to go for a ride into the Sun. As our distance from the Moon increases, presumably the heat effects of the Sun will also increase. As one specific piece of data (distance) increases, another (heat from the Sun) follows suit. There is a direct, positive correlation.
Negative correlations are generally used to spot anomalous indicators. A negative correlation means that as one of two values increases, the other decreases. This means that as one value becomes bigger, the other proportionately shrinks. For instance, let’s look at your common office printer. Every office manager knows that as you print more pages, the amount of toner decreases. So you have one piece of the puzzle growing in number while the other decreases―a negative correlation. How much of a negative correlation has yet to be defined though. There are other variables such as color versus black ink and the number of the pages.
Both positive and negative correlations can look a lot like cause and effect, but they are not. Causality is a direct action to a direct result. If I kick a football, it flies away from my foot; that is cause and effect. There’s no room for interpretation. Now, it could be said that there are a ton more data related to the event (and there are), but the action and outcome are directly cause and effect.
And here’s where we get to the ambiguous subtitle. Recently while using a trending tool available online, I looked up two seemingly unrelated topics: knitting and poison. It became immediately obvious there was a negative correlation between the two. The searches for knitting tapered off as the searches for poison increased at almost exactly the same rate. Either both crafty folks and grandmothers were giving up knitting in favor of poisoning their neighbors (with elderberry wine) or there was some other underlying factor. With a family reunion on the horizon, I had no other choice but to investigate further―it was a matter of survival, you see. Luckily, I work where I do, which means I have some tools available to me for just this sort of problem.
As an analyst working with “big data,” correlating metrics is an exploratory process that helps you discover unknown relationships. The data (or dimensions) can be date, visitor ID, geography, or any number of other applicable dimensions. Ultimately, you can correlate against any dimension type, but from a best-practice standpoint and keeping within the theory of statistics, you need at least 30 observations, instances, or values. Keeping this in mind, I took on the mystery of why Nana was putting down her knitting needles and buying strychnine. I plugged my variables into Adobe Analytics Premium. The color-coded correlation matrix gave me visual points of reference, and I was able to come up with an answer efficiently.
How efficiently is the question? The table below has 16 metrics based on data from a Web service. Remembering our “30 rule,” these metrics were trended over 35 days. Which metrics have the strongest positive correlation with “visits”? It is almost impossible to tell with the naked eye. Our brains are just not designed to absorb all this data in the manner it is presented, at least not when we’re evaluating numerical values on the scale “big data” requires.
However, when we open up a correlation matrix, we can see both numerically and with a heated color map. This makes it easier to identify the metrics with the strongest correlation. Our brain is able to make the connection much more quickly than when we’re looking at just three colors (black, white, and one shade of blue above). As it turns out, visits and single access have the strongest positive correlation (r = 0.983) in this matrix.
So back to Granny and our knitting/poison mystery. As it turns out, the time of year was the factor. During the winter months, when people are indoors more, knitting was more prominently searched. When the weather began to warm up, the searches for poison increased, but the instances for knitting searches dropped off. Likely, as pests and weeds began to take hold, people started looking for ways to eradicate them. Conversely, since the weather was warmer, there was less interest in knitting sweaters and blankets,.
Once I was satisfied that this was indeed the cause, I packed for the reunion, secure that I could again enjoy Aunt Marie’s green beans wrapped in bacon. If I had based my decision as to whether or not to go to the reunion solely off the initial data provided, I would have opted out of my family’s event entirely.
Understanding diagnostic analytics and being able to correlate (both positive and negative) correctly provides an invaluable source of information, but it’s not the only tool to help ferret out anomalous data. Next week we’ll look at Anscome’s Quartet and show how a visual representation of values can help you more accurately identify anomalies before they can become impactful. Watch out for “Anscome’s Quartet,” or “Not All Numbers Were Created Equal.”