If the N-Armed ban­dit prob­lem is the core strug­gle of each test­ing pro­gram, then nor­mal dis­tri­b­u­tion and the related cen­tral limit the­o­rem is the wind­mill that groups use to jus­tify their attempts to solve the N-Armed ban­dit prob­lem. The cen­tral limit the­o­rem is some­thing that a lot of peo­ple have expe­ri­ence with from their high school and col­lege days, but very few peo­ple appre­ci­ate where and how it fits into real world prob­lems. It can be a great tool to act on data, but it can also be used blindly to make poor deci­sions in the name of being eas­ily action­able. Being able to use a tool requires you to under­stand both its advan­tages and dis­ad­van­tages, as with­out con­text you really achieve noth­ing. With that in mind, I want to present nor­mal dis­tri­b­u­tion as the next math sub­ject that every tester should have a much bet­ter under­stand­ing of.

The first thing to under­stand about nor­mal dis­tri­b­u­tion is that it is only one type of dis­tri­b­u­tion. Some­times called a Gauss­ian dis­tri­b­u­tion, the nor­mal dis­tri­b­u­tion is easy iden­ti­fi­able by its bell curve. Nor­mal dis­tri­b­u­tion comes into exis­tence because of the cen­tral limit the­o­rem, which states that any group, under suf­fi­ciently large num­ber of inde­pen­dent ran­dom vari­ables, and with a con­tin­u­ous vari­able out­comes, the mean will approx­i­mate a nor­mal dis­tri­b­u­tion. To put another way, if you take any pop­u­la­tion of peo­ple, and they are inde­pen­dent of each other, then an unbi­ased sam­ple of them will even­tu­ally turn into an attrac­tor dis­tri­b­u­tion, so that you can mea­sure a mean and a stan­dard devi­a­tion. This gives you the famil­iar giant clump­ing of data points around the mean, and that as you move far­ther and far­ther away from that point, the data dis­tri­b­u­tion becomes less and less in a very pre­dictable way. It guar­an­tees that an unbi­ased col­lec­tion done over a long period of time, the mean will reach nor­mal dis­tri­b­u­tion, but in any biased or lim­ited data set, you are unlikely to have the a per­fectly nor­mal distribution.

The rea­son that we love these dis­tri­b­u­tions is that they are the eas­i­est to under­stand and have very com­mon easy to use assump­tions built into them. Schools start with these because they allow an intro­duc­tion into sta­tis­tics and are easy to work with, but just because you are famil­iar with them does not mean the real world always fol­lows this pat­tern. We know that over time, if we get col­lect enough data in an unbi­ased way, we will always reach this type of dis­tri­b­u­tion. It allows us to infer a mas­sive amount of infor­ma­tion in a short period of time. We can look at dis­tri­b­u­tion of peo­ple to cal­cu­late P-Score val­ues, we can see where we are in a con­tin­uum, and we can eas­ily allow us to group and attack larger pop­u­la­tions. It allows us to present data and tackle it in a way with a vari­ety of tools and an easy to under­stand struc­ture, free­ing us to the steps of using the data, not fig­ur­ing out what tools are even avail­able to us. Because of this schools spend an inor­di­nate amount of time in classes pre­sent­ing this prob­lems to peo­ple, with­out inform­ing them of the many real world sit­u­a­tions where they are may not be as actionable.

The prob­lem is when we force data into this dis­tri­b­u­tion when it does not belong, so that we can make those assump­tions and so we act with a sin­gle mea­sure of “value” of the out­come. When you start try­ing to apply sta­tis­tics to data, you must always keep in mind the quote from William Watt, “Do not put your faith in what sta­tis­tics say until you have care­fully con­sid­ered what they do not say.

There are a num­ber of real world prob­lems with try­ing to force real world data into a nor­mal dis­tri­b­u­tion, espe­cially in any short period of time.

Here are just a quick sam­ple of real world influ­ences that can cause havoc when try­ing to apply the cen­tral limit theorem:

1) Data has to be rep­re­sen­ta­tive – Just because you have a per­fect dis­tri­b­u­tion of data for Tues­day at 3am, it has lit­tle bear­ing on being rep­re­sen­ta­tive of Fri­day afternoon.

2) Data col­lec­tion is never unbi­ased, as you can not have a neg­a­tive action in an online con­text. Equally you will have dif­fer­ent propen­si­ties of action from each unique groups, and with an unequal col­lec­tion of those groups to even things out.

3) We are also stuck with the data set that is con­stantly shift­ing and chang­ing, from inter­nal changes and exter­nal changes in time, so that as we gather more data, and as such take more time, the time we take to acquire that data means that the data from the ear­lier gath­er­ing period becomes less rep­re­sen­ta­tive of cur­rent conditions.

4) We have great but not per­fect data cap­tur­ing meth­ods. We use rep­re­sen­ta­tions of rep­re­sen­ta­tions of rep­re­sen­ta­tions. No mat­ter what data acqui­si­tion tech­nol­ogy you use, there are always going to be mechan­i­cal issues which add noise on top of the pop­u­la­tion issues listed above. We need to focus on pre­ci­sion, not become caught in the accu­racy trap.

5) We sub­con­sciously bias our data, through a num­ber of fal­lac­ies, which leads to con­clu­sions that have lit­tle bear­ing on the real world.

In most real world sit­u­a­tions, we more closely resem­ble mul­ti­vari­ate dis­tri­b­u­tion then nor­mal dis­tri­b­u­tion. What this leaves us with is very few cases in the real world that get the point that we can use nor­mal dis­tri­b­u­tion with com­plete faith, espe­cially in any short period of time. Using it and its asso­ci­ated tools with blind loy­alty can lead to groups mak­ing mis­guided assump­tions about their own data, and lead to poor deci­sion mak­ing. It is not “wrong” but it is also not “right”. It is sim­ple another mea­sure of the mean­ing of a spe­cific outcome.

Even if the cen­tral limit the­ory worked per­fectly in real world sit­u­a­tions, you still have to deal with the dif­fer­ences between sta­tis­ti­cal sig­nif­i­cance and sig­nif­i­cance. Just because some­thing is not due to noise, it does not mean that it answers the real ques­tion at hand. There is no mag­i­cal solu­tions to remove the need for an under­stand­ing of the dis­ci­pline of test­ing nor the design of tests that answer ques­tions instead of just pick the bet­ter of two options.

So how then can we use this data?

The best answer is to under­stand that there is no “per­fect” tool to make deci­sions. You are always going to need mul­ti­ple mea­sures, and some human eval­u­a­tion to improve the accu­racy of a deci­sion. A sim­ple look at the graph and hav­ing good rules around when you look at or lever­age sta­tis­ti­cal mea­sures can dra­mat­i­cally improve their value. Not just run­ning a test because you can, and instead focus­ing on under­stand­ing the rel­a­tive value of actions is going to insure you get the value you desire. Sta­tis­tics is not evil, but you can not just have blind faith. Each sit­u­a­tion and each data set rep­re­sents its own chal­lenge, so the best thing you can do is focus on the key dis­ci­plines of mak­ing good deci­sions. These tools help inform you, but are not meant to replace dis­ci­pline and your abil­ity to inter­pret the data and the con­text for the data.