There’s been quite a storm brew­ing around the best method­ol­ogy to use for mul­ti­vari­ate test­ing: fractional-factorial vs. full-factorial. (For a quick primer, def­i­n­i­tions of both are included at the bot­tom of this post.) I have to say that some of the argu­ments I’ve heard bor­der on ide­o­log­i­cal in both their pas­sion and rigor. Are there sce­nar­ios where one method­ol­ogy makes more sense than the other? Absolutely. Is it pos­si­ble that one method­ol­ogy is right for every sce­nario? No. I have my own thoughts on when each approach is applic­a­ble, but I’d first like to see if we can agree on a few statements:

1) The Inter­net changes every day, whether it’s based on oil prices, local and global news, or your com­pe­ti­tion switch­ing tac­tics and prices.

2) What worked on your site six months ago may not be the same thing that works today, and will most likely not be what works six months from now.

3) The most suc­cess­ful Inter­net mar­keters are light on their feet — agile, flex­i­ble, and able to adapt quickly.

4) There is no magic for­mula to the Inter­net. We can­not say that if A & B, then C will hap­pen each and every time.

In a per­fect the­o­ret­i­cal world where none of the state­ments above were true, I would run full-factorial each and every time. That way, I get to under­stand which exact com­bi­na­tion is best out of all the pos­si­ble com­bi­na­tions of ele­ments, and even cal­cu­late all the dif­fer­ent lev­els of inter­ac­tion between ele­ments. How­ever, we unfor­tu­nately don’t have the gift of infi­nite time when run­ning tests and ana­lyz­ing results.

I recently read a case study tout­ing full-factorial and the 576 dif­fer­ent com­bi­na­tions tested. It had great graphs and charts of data, but, in my opin­ion, there were 2 huge things missing:

1) How long did this test take to run? If I go by Google’s handy cal­cu­la­tor, I would esti­mate it took nearly half a year:

I don’t know many com­pa­nies who have the lux­ury of run­ning a test longer than one month, let alone five months!

2) How did dif­fer­ent cus­tomer seg­ments per­form? Were seg­ments even set up and tracked? With 576 com­bi­na­tions to test, even set­ting up two coarse seg­ments such as new vis­i­tor and return vis­i­tor would dou­ble the amount of time the test had to run. In this case, we’re now look­ing at closer to a year! How can any com­pany with var­i­ous acqui­si­tion points and cus­tomer behav­ioral seg­ments run a test and not slice their pop­u­la­tion up to under­stand where the dif­fer­en­ti­a­tion lies though? Con­sider the cus­tomers who search on “gui­tar cen­ter” vs. those who click a PPC ad after search­ing for “les paul gui­tar” — is it pos­si­ble they might react dif­fer­ently in a test? I would say it’s quite likely.

Does all this mean there are no cases where full-factorial might be more effec­tive? Not at all. I have rec­om­mended run­ning a full-factorial to clients in the past when the ele­ments they were test­ing were highly graph­i­cal and seemed inter­de­pen­dent. Take, for exam­ple, a row of dif­fer­ent photo cat­e­gories (Abstracts, Peo­ple, Close ups, B&W, etc) to choose from where each category’s photo rep­re­sen­ta­tion would be con­sid­ered an ele­ment to test. That seems like the appro­pri­ate place to run a full-factorial because you may not want 2 pic­tures that look very sim­i­lar to appear side-by-side. How­ever, there are trade-offs to ded­i­cat­ing the time and traf­fic to full-factorial. You most likely have to severely limit the num­ber of ele­ments you will be test­ing at once. You may also have to forgo cus­tomer seg­men­ta­tion unless you are one of the few com­pa­nies with the ben­e­fit of mil­lions of vis­i­tors a day.

I think that one of our own cus­tomers actu­ally summed it up best for me last week. John Pace, a true cham­pion of test­ing and the head of opti­miza­tion at Real Net­works, likened fractional-factorial test­ing to a barom­e­ter. He’s a sail­ing man, so for­give me if the anal­ogy doesn’t sync up for you. A barom­e­ter mea­sures atmos­pheric pres­sure, but its value is not so much in the pre­cise mea­sure­ment as the noti­fi­ca­tion that there is a direc­tional change in pressure.

In much the same way, test­ing is sup­posed to give you direc­tional feed­back on what is per­form­ing and res­onat­ing best with your vis­i­tors. Test­ing is not a doc­u­ment or proof that you can use to be 100% sure of how your vis­i­tors will behave mov­ing for­ward. Because of that, I ques­tion how valu­able it is to spend 5 months run­ning 1 sin­gle test for learn­ings that may no longer be applic­a­ble by the time the test has com­pleted and the data pumped through analy­sis. Instead, why not take the win­nings and learn­ings of your week-long fractional-factorial mul­ti­vari­ate test and then run another test that builds off that new and improved base­line. If you can approach your test­ing pro­gram that way, I’m con­fi­dent that you’ll find more upside in both lift and learn­ings in the same 5-month period.

At the end of the day, no mat­ter which method­ol­ogy you end up using, the race for con­ver­sions and rev­enue is not going to be won by the might of your sta­tis­tics. Your strengths should be cre­ativ­ity, inno­va­tion, and the abil­ity to lis­ten and react to your cus­tomer. Employ­ing those in test­ing will get you to the fin­ish line; it’s just a mat­ter of whether you get there sooner or later than the rest of the pack.

Def­i­n­i­tions

Mul­ti­vari­ate Test: A mul­ti­vari­ate (MVT) test enables you to test mul­ti­ple ele­ments simul­ta­ne­ously. A mul­ti­vari­ate test exam­ple would be to test the ban­ner, head­line, copy, and call-to-action on a land­ing page. The ben­e­fits of run­ning a mul­ti­vari­ate test are that you can test more ele­ments at once than an A/B test, and you also get infor­ma­tion about which ele­ments were most sig­nif­i­cant and which alter­na­tives pro­duced the most lift.

Full-Factorial Design: Full-factorial design tests all of the dif­fer­ent com­bi­na­tions of ele­ments and their alter­na­tives. For exam­ple, if you had 7 ele­ments on a page with 2 alter­na­tives each, a full-factorial design would test all 128 (27) combinations.

Fractional-Factorial Design: Fractional-factorial design tests a sub­set of all the dif­fer­ent com­bi­na­tions of ele­ments and their alter­na­tives. In the same test exam­ple above, a frac­tional– fac­to­r­ial design using the Taguchi method would test 8 combinations.

7 comments
John Hunter
John Hunter

The advantages of designed experiments with fractional factorial designs is huge in most all real world situations (where you have many potentially important factors). You can read a number of George Box's papers on the power of designed experiments. He is widely seen as one of the top statisticians of the 20th century. I am a bit biased as he, my father and Stu Hunter wrote Statistics for Experimenters together (which I also highly recommend :-). If test runs cost next to nothing full factorial (with multiple runs of combinations) might be fine. In many situations there is a significant cost to additional runs so you need to get information that is worth that additional cost (including time to design the tests and analyze the results). If those costs a very small then they may be warranted even if the expected value in additional information is small. I think it is important for people to understand why they choose a given approach. They should be able to explain, why, in the situation they are in which is the most effective strategy. Changing how content is displayed on a web page and various other options is much different than having to change manufacturing processes.

Lily Chiu
Lily Chiu

Billy - I agree that there is a lot of damaging misinformation floating around out there about fractional-factorial analysis. In fact, the ideology injected into the debate looks a lot like our political landscape these days! But anyway, that's getting off-topic. :) In response to your Google Website Optimizer remark, I do want to clarify that while GWO does not allow you to run a fractional-factorial test design, it can still provide fractional-factorial analysis given the equivalent amount of traffic. The consequences of full-factorial (in terms of time and traffic required) come into play when you're either waiting for a specific combination to reach statistical confidence or waiting to see the interaction effects between elements reach statistical confidence. There are also organizational challenges to consider when running a lengthy full-factorial test. In the example of a 7 element x 2 alternative test, there is a substantial increase in resources required to set up, QA, implement, and analyze a 128-combination full-factorial test vs. an 8-combination fractional-factorial test. As I replied to Avinash previously though, I'd be more than happy to see this methodology debate end in an "agree to disagree" truce so we can move on to more important and valuable topics!

Lily Chiu
Lily Chiu

John - thanks for highlighting the potential for testing "paralysis". I often see the political consequences of test design severely underestimated by marketers who are just getting their feet wet with optimization. I can't stress how important it is to start simple with high-value and easy-to-implement tests even if everyone is clamoring to see the test that changes 15 different elements with 10 variations each. At the end of the day, we all have a short attention span. The same people who wanted the big and complicated test will have found something else to care about by the time that test reaches statistical confidence months later. On a side note, here's a post I wrote a few months ago that talks about how to avoid testing paralysis when you're just getting started: http://blogs.omniture.com/2008/05/29/how-to-make-testing-successful/

Billy Shih
Billy Shih

Thanks for writing about this topic, Lily. I'm glad to see more discussion about this. I agree with the majority of your post, especially if you're talking about a/b split testing as a form of full factorial testing. My main problem with full factorial testing is the amount of weight given to interactions. Proper methodology requires one to do some big idea testing through split or even 1-2 factor full factorial, such as your photography example, and then doing a fractional factorial multivariate test on the winning page. You are right, there are times when full factorial is useful, but the problem is that the majority of people doing testing are doing tests on multiple factors on a page and getting into situations like the one you mentioned above, requiring 150 days worth of data. Tools, like Google Web Optimizer, can not do fractional factorial testing. If you've watched any of the webinars that Google has done with GWO, you can see that the testing done in-house by Google are tests that would be done much quicker with a fractional factorial tool. So while I do push hard for fractional factorial, it is not because I necessarily think there is no place for full factorial. More so I believe the industry is still young and most marketers don't realize that there are 2 choices and additionally those that expound full factorial are misleading people into thinking that that interactions are the end-all-be-all and that fractional factorial gives bad results. The time saved by using fractional factorial can not be emphasized enough. Best of luck with your testing and optimization :) Billy

John Lovett
John Lovett

Hi Lily, I couldn’t agree more with your 4th point, which is really the summation of your previous three, in that “there is no magic [static] formula to the Internet”. Testing is a process of continuous improvement, no matter what methodology you use. What works today may not be effective tomorrow and you won’t realize this unless you’re actively looking. However, if tests are painful to implement and high confidence intervals require lengthy test cycles, the thought of iterative testing is excruciating to some. In my experience, testing can inflict paralysis among marketers. Collectively, marketers need to get testing, learn from the results and take action…. After that it’s a matter of rinse, lather repeat. Here's to a squeaky clean Web, John

Lily Chiu
Lily Chiu

Avinash - thanks for the insightful comments! Let's hope this war finds a peaceful ending soon :)

Avinash Kaushik
Avinash Kaushik

Arguing between full and partial factorial is akin to arguing how many angels and fit on a pin. (Apparently six : )). It is really a dumb exercise to argue about methodology, in any scenario. Sometimes full might make sense, other times partial. Hence I am usually pleased when I have a choice (in the Google Website Optimizer I have a choice to use full or switch to the Section report and see the data as if I were running a partial factorial experiment). There are so many more things that make or break a testing strategy. Are you testing ideas to fix your customer problems (or ones you are dreaming up)? Are you trying radical changes or shades of green on a button? Can you deploy tests fast or it takes nine years? Can you track multiple types of goals or just one? Is your mom happy or sad? So many more productive ways to spend our lives. Lets move to those and put silly things like full or partial behind us. It simply distracts our customers from creating a productive testing strategy. Then no one wins. Nice post Lily. -Avinash. PS: In case Omniture allows links to Google :), I love the explanation of the choice here (and especially the car example that even a lay person could understand): http://www.google.com/support/websiteoptimizer/bin/answer.py?hl=en&answer=74818