As a data scientist on the Adobe Target team, John Kucera—a trained physicist with a strong mathematical background that he brings to software development—talked with me about trust issues and today’s digital marketer. From where he sits, trust comes down to practical data science, and “taking thoughts that come out of the research community and testing it against our real customer data, and from that, determining what kind of features and capabilities we should put into the product.” Think better dashboards, more streamlined—but comprehensive—reporting and, ultimately, making sure we as marketers actually compute the right numbers. Here’s a piece of that conversation:
You have some expertise in the area of statistics. Can you talk about the role that statistics play in informing marketer insights, and how they can be actionable?
That’s a big question, but definitely the right one to be asking. When it comes to making sense of massive amounts of marketing data, statistics plays a big role. Without the application of statistics, it’s extremely difficult to really understand what is “true”—especially when it comes to predicting consumer outcomes, optimizing digital marketing and so on. Simply put the application of statistics in A/B testing or automated personalization allows us to separate real signals from noise.
“Confidence” is a term we hear a lot when we talk about testing. I think most of us get it conceptually, but getting down to brass tacks, what should marketers know about confidence in the statistical sense?
Let me tell you how we look at confidence in A/B testing and Adobe Target specifically—it’s really a few things. First, it’s important to understand something called ‘p value’. This is a term used by statisticians, and it represents a probability of what they call, “falsely rejecting the non-hypothesis”. That sounds kind of complicated, but it simply means that when you measure two things, such as measuring the conversion rate of experience A and experience B, you could get slightly different results purely because of statistical fluctuations — even if the conversion rates of A and B are exactly the same. The p value is roughly the probability that you get different results on different measurements even if you are measuring exactly the same thing.
Think about this example: if you were to flip a fair coin three times, and asked a whole bunch of other people to do that, occasionally you would find someone who actually got three heads. Now does that mean that their coin is biased? No. It’s actually just a fluke, right? So with this “p value,” and “confidence” we’re trying understand: what is the likelihood that an experience you are testing against a control is truly different from that control, and what is the likelihood that it is just a fluke? The null hypothesis is just the hypothesis that the experience and the control have the same conversion rates. So to compute the p value we assume that the control experience and the test experience are identical (the null hypothesis), and then compute the probability that you would see the results that you got.
So that’s what the p value measures, and the confidence is just 1minus that p value. Here’s another example: a poll looks at election results, and the pollsters say there is a 95% confidence that one particular candidate will win over another. That means that there is a 5% probability that their poll showed that one candidate was winning, but in reality they both were just completely tied. And, therefore, this p value of 5% corresponding to a confidence of 95% is this chance that they may have made a mistake, just because they didn’t take enough data.
Are we talking about margin of error, and that marketers just need to accept a certain amount of it?
Well it’s actually kind of related to the margin of error, but it’s a slightly different thing. The margin of error effectively tells you +/- how closely can we measure something. Every time we measure something in science, and definitely in marketing, you can only get it to a certain level of precision. So when you take different measurements, you get slightly different numbers. And the margin of error usually reflects how much difference there is between the different measurements when you repeat the measurement several times. Margin of error says, how well do you really know this number.
Confidence shouldn’t be confused with the confidence interval—confidence interval is more this margin of error. Confidence is more the probability that two different experiences are different. You could sort of roughly think of it as the likelihood that these experiences are different. But that’s a very rough way of looking at it.
So at a very basic level: a marketer has a hypothesis, tests that hypothesis and assumes that the higher the confidence level, the better the test. There’s my indicator that I have a clear winner. Is that true or are there nuances to that?
It’s mostly true. It’s really how people think about confidence and it seems perfectly logical to think that way. But, once again, think p-value. A p-value of 5% is really good because it means that you have only a 5% chance of coming to an incorrect conclusion that A and B were different. In other words, you have 95% confidence. And that is usually considered to be—in most science and medical tests—reasonable proof that the two things are different. But what we find is that often people will look at confidence numbers of 75% and 80% and 90% and feel that those confidence levels are good enough. In my opinion you should have at least a 95% confidence value before you even have decent evidence that the two experiences might be different in an A/B test.
If you think about it, it’s basically saying you have 5% chance of making a mistake. And if you were to, for example, look at your confidence numbers several days in a row, then each day you would have this same chance of making a mistake, and it kind of quickly adds up. So often, people use even higher confidence levels such as 99% or 99.9%.
In that vein, data science has been getting a lot of attention recently—specifically where machine learning algorithms and statistics meet marketing. That says to me that marketers are starting to put more trust in the math. So why now, and why haven’t they to this point?
It’s tough. Everyone is familiar with this notion that you can ‘lie’ with statistics. Certainly numbers can be massaged—so to speak—to tell the story you want to tell or to support a desired foregone conclusion. Of course, as a marketer you have to be very careful about what conclusions you draw from testing. You need to be confident that you can stand up those results, since there are bottom line consequences involved, not to mention your credibility. There is a story I like to tell. Often I see ads that say something like 95% of all milk producers in the state of California are small farmers, I’m not sure exactly that its 95%, but the milk industry in California advertises this very high percentage. But, what they do in that case is they are counting it by the number of producers, not necessarily by the amount of milk being produced. It could be that the top 5% of milk producers are huge corporations and they account for almost all of the milk sold in California. But by saying that 95% of the milk is produced by small farms, it makes you feel like, “Oh yeah, all of this milk is coming from mom and pop stuff,” when in reality it’s not. So that’s where you can get lost in statistics if you don’t carefully understand what’s being measured.
So what about the future of statistics and how it relates to marketing? What do you see as the untapped frontiers?
Well, we currently use statistical test techniques that are based on what I would call classical A/B testing. It was developed in order to be able to measure and make decisions quickly in relatively low computing power environments. For example, most medical tests are still done using these techniques—the same techniques that we use in Adobe Target now—it’s called “a frequentist approach” and it essentially assumes that you can repeat an experiment over and over again, and get some idea of what you might expect to see in terms of how much the results would differ. We’re looking at moving towards interpretation in terms of what’s called Bayesian statistics. Where you incorporate a certain prior belief in terms of what you think the likelihood of something is before you make a measurement, and that measurement then drives that prior belief into something that’s closer to what reality is. So, I think that Bayesian hypothesis testing is an area that is definitely up and coming, and on the cutting edge of things right now.
Final thoughts: I’m a marketer, just getting my feet wet, specifically around analytics and A/B testing. What sort of general advice do you have for me?
My main advice to current users of Adobe Target is to be sure to define the length of an A/B test ahead of time. Ideally it should be multiple numbers of weeks, primarily because there’s a kind of weekly pattern of ups and downs in people’s behavior. Find periods of time that don’t have any major events like a holiday in the middle. Predefine your test ahead of time, and when your test is done, look at the confidence numbers. Look only once at the end of the test! If you look at the results every day and keep track of them, what you will find is that every now and then, you will have a high confidence, even when there really isn’t something there. And it’s just because this confidence number is actually a statistical thing too, the more times you measure it, the more likely you are to see what you want.