One of the prob­lems that never seems to be elim­i­nated from the world of data is edu­ca­tion and under­stand­ing on the nature of com­par­ing data between sys­tems. When faced with the issue, too many com­pa­nies find the vari­ance between their dif­fer­ent data solu­tions to be a major sign of a prob­lem with their report­ing, but in real­ity vari­ance between sys­tems is expected. One of the hard­est lessons that groups can learn is to focus on the value and the usage of infor­ma­tion over the exact mea­sure of the data. This plays itself out now more than ever as more and more groups find them­selves with a mul­ti­tude of tools, all offer­ing report­ing and other fea­tures about their sites and their users. As more and more users are deal­ing with the real­ity of mul­ti­ple report­ing solu­tions, they are dis­cov­er­ing that all the tools report dif­fer­ent num­bers, be it vis­its, vis­i­tors, con­ver­sion rates, or just about any­thing else. There can be a star­tling real­iza­tion that there is no sin­gle mea­sure of what you are or what you are doing, and for some groups this can strip them of their faith in their data. This vari­ance prob­lem is noth­ing new, but if not under­stood cor­rectly, it can lead to some mas­sive inter­nal con­fu­sion and dis­trust of the data.

I had to learn this les­son the hard way. I worked for a large group of web­sites who used 6 dif­fer­ent sys­tems for basic ana­lyt­ics report­ing alone. I led a team to dive into the dif­fer­ent sys­tems and under­stand why they reported dif­fer­ent things and to fig­ure out which one was ”right.” After los­ing months of time and almost los­ing com­plete faith in our data, we dis­cov­ered some really impor­tant hard won lessons. We learned that the use of the data is para­mount, that there is no one view or right answer, that vari­ance is almost com­pletely pre­dictable once you learn the sys­tems, and that we would have been far bet­ter served spend­ing that time on how to use the data instead of why they were different.

I want to help your orga­ni­za­tion avoid the mis­takes that we made. The truth is that no mat­ter how deep you go, you will never find all the rea­sons for the dif­fer­ences. The largest les­son learned was that your orga­ni­za­tion can be so caught up in the quest for per­fect data that they for­get about the actual value of that data. To make sure you don’t get caught in this trap, I want to help estab­lish when and if you do have a prob­lem, the most com­mon rea­sons for vari­ance between sys­tems, and some sug­ges­tions about how to think about and how to use the new data chal­lenge that mul­ti­ple report­ing sys­tems presents.

Do you have a problem?

First, we must set some guide­lines around when you have a vari­ance prob­lem and when you do not. When you have sys­tems designed for dif­fer­ent pur­poses, they will lever­age that data in very dif­fer­ent ways. No sys­tems will match, and in a lot of cases, being too close rep­re­sents arti­fi­cial con­straints on the data that is actu­ally hin­der­ing its usabil­ity. At the same time, if you are too far apart, then that is a sign that there might be a report­ing issue with one or both of the solutions.

Here are two sim­ple ques­tions to eval­u­ate if you do have a vari­ance “problem”:

1) What is the vari­ance percentage?

Nor­mal vari­ance between sim­i­lar data sys­tems is almost always between 15–20%.
For non-similar data sys­tems the range is much larger, and is usu­ally between 35–50%.

If the gap is too low or too large, then you may have a prob­lem. A 2% vari­ance is actu­ally a worse sign then a 28% vari­ance on sim­i­lar data systems.

Many groups run into the issue of try­ing too hard to con­strain vari­ance. The result is that they put arti­fi­cial con­straints on their data, caus­ing the rep­re­sen­ta­tive nature of the data to be severely ham­pered. Just because you believe that vari­ance should be lower does not mean that it really should be or that lower is always a good thing.

This analy­sis should be done on non-targeted groups of the same pop­u­la­tion (e.g., all users to a unique page.) The vari­ance for defen­dant track­ing (seg­ments) is going to always be higher.

2) Is the vari­ance con­sis­tent in a small range?

You may see vari­ance be in a series of 13, 17, 20, 14, 16, 21, 12 over a few days, but you should not see 5, 40, 22, 3, 78, 12.

If you are within the nor­mal range and you are in the nor­mal range of out­comes, then con­grat­u­la­tions, you are deal­ing with per­fectly nor­mal behav­ior and I could not more strongly sug­gest that you spend your time and energy on how best to use the dif­fer­ent data.

Data is only as valu­able as how you use it, and while we love the idea of one per­fect mea­sure of the online world, we have to remem­ber that each sys­tem is designed for a pur­pose, and that mak­ing one uni­ver­sal sys­tem comes with the cost of los­ing spe­cial­ized func­tion and value.

Always keep in mind these two ques­tions when it comes to your data:

1) Do I feel con­fi­dent that my data accu­rately reflects my users’ dig­i­tal behavior?

2) Do I feel that things are tracked in a con­sis­tent and action­able fashion?

If you can’t answer those ques­tions with a yes, then vari­ance is not your issue. Vari­ance is the mea­sure of the dif­fer­ences between sys­tems. If you are not con­fi­dent in a sin­gle sys­tem, then there is no point in com­par­ing it. Equally, if you are com­fort­able with both sys­tems, then the dif­fer­ences between them should mean very little.

The most impor­tant thing I can sug­gest is that you pick a sin­gle data sys­tem as a sys­tem of record for each action you do. Every sys­tem is designed for dif­fer­ent pur­poses, and with that pur­pose in mind, each one has advan­tages and dis­ad­van­tages. You can def­i­nitely look at each sys­tem for sim­i­lar items, but when it comes time to act or report, you need to be con­sis­tent and have all con­cerned par­ties aligned on which sys­tem is the one that every­one looks at. Choos­ing how and why you are going to act before you get to that part of the process is the eas­i­est fastest way to insure the reduc­tion of orga­ni­za­tional bar­ri­ers. Get­ting this agree­ment is far more impor­tant for going for­ward than the dive into the causes behind nor­mal variance.

Why do sys­tems always have variance?

For those of you who are still not com­pletely sold or who need to at least have some quick answers for senior man­age­ment, I want to make sure you are pre­pared.
Here are the most com­mon rea­sons for vari­ance between systems:

1) The rules of the sys­tem – Visit based sys­tems track things very dif­fer­ently than vis­i­tor based sys­tems. They are meant for very dif­fer­ent pur­poses. In most cases, a visit based sys­tem is used for incre­men­tal daily count­ing, while a vis­i­tor based sys­tem is designed to mea­sure action over time.

2) Cook­ies – Each sys­tem has dif­fer­ent rules about track­ing and stor­ing of cookie infor­ma­tion over time. This track­ing will dra­mat­i­cally impact what is or not tracked. This is even more true for 1st ver­sus 3rd party cookie solutions.

3) Rules of inclu­sion vs. Rules of exclu­sion – For the most part, all ana­lyt­ics solu­tions are rules of exclu­sion, mean­ing that you really have to do some­thing (IP fil­ter, data scrub­bing, etc.) to not be tracked. A lot of other sys­tems, espe­cially test­ing, are rules of inclu­sion, mean­ing you have to meet very spe­cific cri­te­ria to be tracked. This will dra­mat­i­cally impact the pop­u­la­tions, and also any tracked met­rics from those populations.

4) Def­i­n­i­tions – What some­thing means can be very spe­cific to a sys­tem. Be it a con­ver­sion, a seg­ment, a refer­rer, or even a site action. The very def­i­n­i­tion can be dif­fer­ent. An exam­ple of this would be a paid key­word seg­ment. If I land on the site, and then see a sec­ond page, what is the refer­rer for that page? Is it the visit or the refer­ring page? Is it some­thing I did on an ear­lier visit?

5) Mechan­i­cal Vari­ance – There are mechan­i­cal dif­fer­ences in how sys­tems track things. Are you track­ing the click of a but­ton with an onclick? Or is land­ing on the pre­vi­ous page? Or is it he server request? Do you use a log file sys­tem or a bea­con sys­tem? Is that a unique request or added on to the next page tag? Do you rely on cook­ies or are all actions inde­pen­dent? What are the dif­fer­ent tim­ing mech­a­nisms for each sys­tem? Do they col­lide with each other or other site functions?

Every sys­tem does things dif­fer­ently, and as such these smaller changes can build up over time, espe­cially when com­bined with some of the other rea­sons listed above. There are hun­dreds of rea­sons beyond those listed, and the real­ity is that each sit­u­a­tion is unique and each one is the cul­mi­na­tion of the impact of these hun­dred dif­fer­ent rea­sons. There is no way to ever get to the point where you can accu­rately describe with 100% cer­tainty why you get the variance.

Vari­ance is not a new issue, but it is one that can be the death of pro­grams if not dealt with in a proac­tive man­ner. Armed with this infor­ma­tion, I would strongly sug­gest that you hold con­ver­sa­tions with your data stake­hold­ers before you run into the ques­tions that inevitably come. Estab­lish­ing what is nor­mal, how you act, and a few rea­sons why you are deal­ing with the issue should help cut all of these prob­lems off at the pass.

1 comments
Liz
Liz

Great Post, Thanks!