I’m going to get a bit nerdy right off the bat this time around. I gen­er­ally wait until the sec­ond or third para­graph before I let my nerd flag fly, but this time there are some things I need to define to the aver­age reader. The topic (as you can see by the title) is cor­rel­a­tive data. In sta­tis­tics, we have what is called pos­i­tive and neg­a­tive cor­rel­a­tive data. A pos­i­tive cor­re­la­tion means that as one spe­cific piece of data reaches a pos­i­tive mea­sure, another fol­lows suit. For exam­ple, let’s say we have a rocket on the Moon, and for some unknown rea­son, we decide we are going to go for a ride into the Sun. As our dis­tance from the Moon increases, pre­sum­ably the heat effects of the Sun will also increase. As one spe­cific piece of data (dis­tance) increases, another (heat from the Sun) fol­lows suit. There is a direct, pos­i­tive correlation.

Neg­a­tive cor­re­la­tions are gen­er­ally used to spot anom­alous indi­ca­tors. A neg­a­tive cor­re­la­tion means that as one of two val­ues increases, the other decreases. This means that as one value becomes big­ger, the other pro­por­tion­ately shrinks. For instance, let’s look at your com­mon office printer. Every office man­ager knows that as you print more pages, the amount of toner decreases. So you have one piece of the puz­zle grow­ing in num­ber while the other decreases―a neg­a­tive cor­re­la­tion. How much of a neg­a­tive cor­re­la­tion has yet to be defined though. There are other vari­ables such as color ver­sus black ink and the num­ber of the pages.

Both pos­i­tive and neg­a­tive cor­re­la­tions can look a lot like cause and effect, but they are not. Causal­ity is a direct action to a direct result. If I kick a foot­ball, it flies away from my foot; that is cause and effect. There’s no room for inter­pre­ta­tion. Now, it could be said that there are a ton more data related to the event (and there are), but the action and out­come are directly cause and effect.

And here’s where we get to the ambigu­ous sub­ti­tle. Recently while using a trend­ing tool avail­able online, I looked up two seem­ingly unre­lated top­ics: knit­ting and poi­son. It became imme­di­ately obvi­ous there was a neg­a­tive cor­re­la­tion between the two. The searches for knit­ting tapered off as the searches for poi­son increased at almost exactly the same rate. Either both crafty folks and grand­moth­ers were giv­ing up knit­ting in favor of poi­son­ing their neigh­bors (with elder­berry wine) or there was some other under­ly­ing fac­tor. With a fam­ily reunion on the hori­zon, I had no other choice but to inves­ti­gate further―it was a mat­ter of sur­vival, you see. Luck­ily, I work where I do, which means I have some tools avail­able to me for just this sort of problem.

As an ana­lyst work­ing with “big data,” cor­re­lat­ing met­rics is an exploratory process that helps you dis­cover unknown rela­tion­ships. The data (or dimen­sions) can be date, vis­i­tor ID, geog­ra­phy, or any num­ber of other applic­a­ble dimen­sions. Ulti­mately, you can cor­re­late against any dimen­sion type, but from a best-practice stand­point and keep­ing within the the­ory of sta­tis­tics, you need at least 30 obser­va­tions, instances, or val­ues. Keep­ing this in mind, I took on the mys­tery of why Nana was putting down her knit­ting nee­dles and buy­ing strych­nine. I plugged my vari­ables into Adobe Ana­lyt­ics Pre­mium. The color-coded cor­re­la­tion matrix gave me visual points of ref­er­ence, and I was able to come up with an answer efficiently.

How effi­ciently is the ques­tion? The table below has 16 met­rics based on data from a Web ser­vice. Remem­ber­ing our “30 rule,” these met­rics were trended over 35 days. Which met­rics have the strongest pos­i­tive cor­re­la­tion with “vis­its”? It is almost impos­si­ble to tell with the naked eye. Our brains are just not designed to absorb all this data in the man­ner it is pre­sented, at least not when we’re eval­u­at­ing numer­i­cal val­ues on the scale “big data” requires.

1

 

How­ever, when we open up a cor­re­la­tion matrix, we can see both numer­i­cally and with a heated color map. This makes it eas­ier to iden­tify the met­rics with the strongest cor­re­la­tion. Our brain is able to make the con­nec­tion much more quickly than when we’re look­ing at just three col­ors (black, white, and one shade of blue above). As it turns out, vis­its and sin­gle access have the strongest pos­i­tive cor­re­la­tion (r = 0.983) in this matrix.

2

So back to Granny and our knitting/poison mys­tery. As it turns out, the time of year was the fac­tor. Dur­ing the win­ter months, when peo­ple are indoors more, knit­ting was more promi­nently searched. When the weather began to warm up, the searches for poi­son increased, but the instances for knit­ting searches dropped off. Likely, as pests and weeds began to take hold, peo­ple started look­ing for ways to erad­i­cate them. Con­versely, since the weather was warmer, there was less inter­est in knit­ting sweaters and blankets,.

Once I was sat­is­fied that this was indeed the cause, I packed for the reunion, secure that I could again enjoy Aunt Marie’s green beans wrapped in bacon. If I had based my deci­sion as to whether or not to go to the reunion solely off the ini­tial data pro­vided, I would have opted out of my family’s event entirely.

Under­stand­ing diag­nos­tic ana­lyt­ics and being able to cor­re­late (both pos­i­tive and neg­a­tive) cor­rectly pro­vides an invalu­able source of infor­ma­tion, but it’s not the only tool to help fer­ret out anom­alous data. Next week we’ll look at Anscome’s Quar­tet and show how a visual rep­re­sen­ta­tion of val­ues can help you more accu­rately iden­tify anom­alies before they can become impact­ful. Watch out for “Anscome’s Quar­tet,” or “Not All Num­bers Were Cre­ated Equal.”

0 comments