In deal­ing with the best ways to change a site for max­i­mize ROI, one of the most com­mon refrains I hear is “is the change sta­tis­ti­cally con­fi­dent” or “what is the con­fi­dence inter­val”, which often leads to a long dis­cus­sion around what do those mea­sures really mean. One of the fun­ni­est things in our indus­try is the over reliance on sta­tis­ti­cal mea­sures to prove that some­one is “right”. Whether it is Z-Score, T-Test, Chi-Squared or other mea­sures, and peo­ple love to throw them out and use them as the end-all be-all of con­fir­ma­tion that they and they alone are cor­rect. Reliance on any one tool, dis­ci­pline, or action to “prove” value does noth­ing to improve per­for­mance or to allow you to make bet­ter deci­sions. These sta­tis­ti­cal mea­sures can be extremely valu­able, when used in the right con­text and with­out blind reliance on them to answer any and all questions.

Con­fi­dence based cal­cu­la­tions are often used in a way that leaves them being the least effec­tive way to the true mea­sures change and impor­tance of data (or “who is cor­rect”) when they are applied to real world sit­u­a­tions. They work great in a con­trolled set­ting, and with infini­tum data, but in the real world, they are just one of many imper­fect stan­dards for mea­sur­ing the impact of data and changes. Real world data dis­tri­b­u­tion, espe­cially over any short period of time, rarely resem­bles nor­mal dis­tri­b­u­tion. You are also try­ing to account for dis­tinct groups with dif­fer­ing propen­si­ties of action, instead of try­ing to account for one larger rep­re­sen­ta­tive pop­u­la­tion. What is also impor­tant to note is that even in the best case sce­nario, these mea­sures work if you have a rep­re­sen­ta­tive data set, mean­ing that just a few hours or even a cou­ple of days of data will never be rep­re­sen­ta­tive (unless you Tues­day morn­ing vis­i­tors are iden­ti­cal to your Sat­ur­day after­noon vis­i­tors). What you are left with is your choice of many imper­fect mea­sures which are use­ful, but are not mean­ing­ful enough to be the only tool you use to make decisions.

What is even worse is that peo­ple also try to use this value as a pre­dic­tor of out­come, so they say things like I am 95% con­fi­dent that I will get 12% lift. These mea­sures only mea­sure the like­li­hood of the pat­tern of out­come, so that you can say, I am 95% con­fi­dent that B will be bet­ter than A, but they are not mea­sures of the scale of out­come, only the pattern.

It is like some­one found this new fancy tool, and sud­denly has to apply it because they real­ize that what they were pre­vi­ously doing was wrong, but now, this one thing, will sud­denly make them per­fect. Like any tool at your dis­posal, there is a lot of value when used cor­rectly and with the right amount of dis­ci­pline. When you are not dis­ci­plined in how you eval­u­ate data, you will never really under­stand it and use it to make good decisions.

So if you can not rely on con­fi­dence alone, how best to deter­mine if you should act on data? Here are three really sim­ple steps to mea­sure impact of changes when eval­u­at­ing causal data sets:

1) Look at per­for­mance over time – Look at the graph, look for con­sis­tency of data, and look for lack of inflec­tion points(comparative analy­sis). Make sure you have at least 1 week of con­sis­tent data (that is not the same as just one week of data). You can­not replace under­stand­ing pat­terns, look­ing at the data, and under­stand­ing its mean­ing. Noth­ing can replace the value of just eye balling your data to make sure you are not get­ting spiked on a sin­gle day and that your data is con­sis­tent. This human level check gives you the con­text that helps cor­rect against so many imper­fec­tions that just look­ing at the end num­bers leaves you open for.

2) Make sure you have enough data – The amount needed changes by site. Some sites, 1000 con­ver­sions per recipe is not enough, some sites 100 per recipe are. Under­stand your site and your data flow. I can­not stress enough that data with­out con­text is not valu­able. You can get 99% con­fi­dence on 3 con­ver­sions over 1, but that doesn’t make it valu­able or the data actionable.

3) Make sure you have mean­ing­ful dif­fer­en­ti­a­tion –Make sure you know what your nat­ural vari­ance is for your site (in a vis­i­tor based met­ric sys­tem, it is pretty reg­u­larly around 2% after a week). There are many easy ways to fig­ure out what it is for the con­text of what you are doing. You can be 99% con­fi­dent at .5% lift, and I will tell you have noth­ing (neu­tral). You can have 3% lift and 80% con­fi­dence, if it is over a con­sis­tent week and you nat­ural vari­ance is below 3%, and I will tell you have a decent win.

I have got­ten into many debates with sta­tis­ti­cians whether con­fi­dence pro­vides any value at all in the con­text of online test­ing, and my usual answer is that if you under­stand what it means, it can be a great barom­e­ter and another fail safe that you are mak­ing a sound deci­sion. The fail­ure is that you can’t just use it as the only tool in your arse­nal. I am not say­ing that there is not a lot of value from P-value based cal­cu­la­tions, or most sta­tis­ti­cal mod­els. I will stress how­ever that they are not panacea nor are they an excuse for not doing active work to under­stand and act on your data. You have to be will­ing to let the data dic­tate what is right, and that means you must be will­ing to under­stand the dis­ci­plines of using the data itself.