The study of his­tory teaches you many key lessons, such as the lack of unre­li­a­bil­ity from the first per­son nar­ra­tive, the inabil­ity to under­stand the scope of some­thing while it is hap­pen­ing, and most impor­tantly that his­tory is writ­ten by the vic­tors. Howard Zinn made a liv­ing out of show­ing peo­ple just how true this is and how lit­tle we under­stand our own world because of this. This same mis­take is made not just in his­tor­i­cal analy­sis, but all the time in data analy­sis as when we only look at attrib­utes of the “win­ning” side while for­get­ting to ana­lyze the non-winning side of the data. While closely related to the Halo Effect, this effect shows itself in what peo­ple like Degrasse Tyson and Taleb refer to as the grave­yard of knowledge.

Neil Degrasse Tyson best explains the grave­yard of knowl­edge with one of his stories:

You read a study that says that 80% of peo­ple who sur­vived a plane crash stud­ied the exit routes before the plane took off. Com­fort­able with this knowl­edge, the next time you board a plane, you quickly study the exit routes on the plane. As you do this, you start to ana­lyze that data and you come to a sud­den real­iza­tion, what if 100% of the peo­ple who did not sur­vive the crash stud­ied the exit routes?

We don’t know the other side of the story, because there is no one there to report. We only know the peo­ple who came back and what they can tell us, but we have no clue what was going on with those that did not come back. Knowl­edge is lost all the time when peo­ple only look at the win­ners. Win­ners are the only ones left to tell their sto­ries, so we only look to them for details. The real­ity is that the impor­tant details are rarely only on the win­ning side, and all the peo­ple who never returned give us just as vital knowl­edge. We lose all the really impor­tant infor­ma­tion in that grave­yard of the peo­ple who never returned. In fact, we only can start to under­stand any­thing if we have both pieces in order to have con­text for our infor­ma­tion. We focus on those that “sur­vived” that we com­pletely ignore the con­text of the peo­ple who did not. We look only at the behav­iors we want and then extract qual­i­ties about that group of peo­ple, with­out look­ing at the pop­u­la­tion as a whole or more impor­tantly, what would have hap­pened if we did nothing.

We look for peo­ple based on the end of their behav­ior, and not their def­i­n­i­tion before. We love to know what are the char­ac­ter­is­tics of peo­ple who made a pur­chase, or of peo­ple who come to our site more then 3 times. We look back­wards from that win­ning behav­ior because that is all we think we have avail­able to us. We love to describe past behav­ior through cor­rel­a­tive behav­ior, and then attribute “value” to those actions. Peo­ple who pur­chase use inter­nal search 2 times on aver­age, there­fore inter­nal search must be the cause of that action. Peo­ple come from social sources spend $4.56 on aver­age, there­fore social is worth $4.56. We don’t know what would have hap­pened if the same per­son didn’t use search or come from social, would they have spent more or less? All of these types of analy­sis attribute past behav­ior to end value, miss­ing the point that we don’t know what they would have done oth­er­wise. Is look­ing at the exit routes help­ing or hurt­ing your abil­ity to sur­vive? We don’t know if more is bet­ter, we instead assume a lin­ear rela­tion­ship. If cam­paign X is gen­er­at­ing value Y, then dou­bling spend on X will of course gen­er­ate 2Y.

Look­ing only at the data from one group or that define the “win­ners” means that you have com­pletely lost any value from that data. We can not express how much bet­ter or worse an action made things, only that we have X amount of search spend and ended up with Y revanue. Even worse, pre­tend­ing that you can derive cause and effect from the larger con­text means that you are not get­ting value from the actual data itself, but instead prop­a­gat­ing your own world view and using the data only to sup­port it. Like the Texas sharp­shooter fal­lacy, you are cre­at­ing a story to fill in what is most likely ran­dom noise from the data. Rates of action, such as 80% of peo­ple looked at the exit routes, tell you noth­ing unless you know both that increas­ing that num­ber increases your abil­ity sur­vive, and you know the cost and abil­ity to influ­ence peo­ple to make that action. I can tell you that 100% of peo­ple who are deter­mined to spend $1000 on your site will spend at least $1000, but that doesn’t tell me how I get those peo­ple in the first place, or if it is worth my time to spend the resources there for that small pop­u­la­tion as opposed to the mul­ti­tude of other alternatives.

Peo­ple make this mis­take all the time in the world of data analy­sis when they get so caught up on a set path or on look­ing back­wards from an event. They want to know what all the peo­ple who pur­chased did, or what all the peo­ple who come to your site 4 times have in com­mon. There is even a whole world of sta­tis­ti­cal analy­sis focused on clus­ter­ing and per­sonas which is mak­ing a large push in our indus­try that is focused on this ten­dency. The mis­take peo­ple make is that only a small part of your pop­u­la­tion fails to tell you the con­text of that infor­ma­tion. Like the plane, know­ing the attrib­utes of one group doesn’t tell you the attrib­utes of the pop­u­la­tion as a whole. Even worse, it assumes that those attrib­utes have any­thing to do with that behav­ior. We have no way of know­ing if peo­ple who sur­vived just hap­pened to look at the exit routes, or if peo­ple who look at exit routes are more likely to survive.

In the world of test­ing, this bias makes itself present in peo­ple who want to know actions between steps. They want to know of peo­ple who pur­chased, did they go to a prod­uct page or the a search results page. They want to know what path or what peo­ple clicked on. Even if this knowl­edge was not ignor­ing the grave­yard of knowl­edge, what would it tell you? More peo­ple went to the search results page, is that a good thing or a bad thing? You are accom­plish­ing noth­ing with this data except adding cost and slow­ing down your abil­ity to make the cor­rect deci­sion. It is easy to get lost in the world of data if you are try­ing to tell a story or if you want to find a pre­con­ceived point, but as soon as you are try­ing to use the data to find an answer and not just sup­port your point of view, the dis­ci­pline of what you look at and know­ing what it can tell you becomes paramount.

So the ques­tion is, 40 years from now, will all the analy­sis you do be part of the “win­ning” group, or will it be lost in the grave­yard? Stop pre­tend­ing that data tells you more than it really does and stop only look­ing at the win­ning side, and you will be able to derive mag­ni­tudes greater value from your data. The dis­ci­pline of look­ing at the whole con­text and of dis­cov­er­ing the value of actions is what will grant you results, not just find­ing sto­ries. Remem­ber that pat­terns are only pat­terns, they are nei­ther good nor bad, and it is incred­i­bly easy to for­get that even if they are per­fect, they tell you noth­ing about your abil­ity to change them, or the cost to do so. Data can be the most pow­er­ful tool in your arse­nal, but it can also be abused to no end and pro­vide neg­a­tive value and a blan­ket jus­ti­fi­ca­tion for poor decisions.