There’s been quite a storm brew­ing around the best method­ol­ogy to use for mul­ti­vari­ate test­ing: fractional-factorial vs. full-factorial. (For a quick primer, def­i­n­i­tions of both are included at the bot­tom of this post.) I have to say that some of the argu­ments I’ve heard bor­der on ide­o­log­i­cal in both their pas­sion and rigor. Are there sce­nar­ios where one method­ol­ogy makes more sense than the other? Absolutely. Is it pos­si­ble that one method­ol­ogy is right for every sce­nario? No. I have my own thoughts on when each approach is applic­a­ble, but I’d first like to see if we can agree on a few statements:

1) The Inter­net changes every day, whether it’s based on oil prices, local and global news, or your com­pe­ti­tion switch­ing tac­tics and prices.

2) What worked on your site six months ago may not be the same thing that works today, and will most likely not be what works six months from now.

3) The most suc­cess­ful Inter­net mar­keters are light on their feet — agile, flex­i­ble, and able to adapt quickly.

4) There is no magic for­mula to the Inter­net. We can­not say that if A & B, then C will hap­pen each and every time.

In a per­fect the­o­ret­i­cal world where none of the state­ments above were true, I would run full-factorial each and every time. That way, I get to under­stand which exact com­bi­na­tion is best out of all the pos­si­ble com­bi­na­tions of ele­ments, and even cal­cu­late all the dif­fer­ent lev­els of inter­ac­tion between ele­ments. How­ever, we unfor­tu­nately don’t have the gift of infi­nite time when run­ning tests and ana­lyz­ing results.

I recently read a case study tout­ing full-factorial and the 576 dif­fer­ent com­bi­na­tions tested. It had great graphs and charts of data, but, in my opin­ion, there were 2 huge things missing:

1) How long did this test take to run? If I go by Google’s handy cal­cu­la­tor, I would esti­mate it took nearly half a year:

I don’t know many com­pa­nies who have the lux­ury of run­ning a test longer than one month, let alone five months!

2) How did dif­fer­ent cus­tomer seg­ments per­form? Were seg­ments even set up and tracked? With 576 com­bi­na­tions to test, even set­ting up two coarse seg­ments such as new vis­i­tor and return vis­i­tor would dou­ble the amount of time the test had to run. In this case, we’re now look­ing at closer to a year! How can any com­pany with var­i­ous acqui­si­tion points and cus­tomer behav­ioral seg­ments run a test and not slice their pop­u­la­tion up to under­stand where the dif­fer­en­ti­a­tion lies though? Con­sider the cus­tomers who search on “gui­tar cen­ter” vs. those who click a PPC ad after search­ing for “les paul gui­tar” — is it pos­si­ble they might react dif­fer­ently in a test? I would say it’s quite likely.

Does all this mean there are no cases where full-factorial might be more effec­tive? Not at all. I have rec­om­mended run­ning a full-factorial to clients in the past when the ele­ments they were test­ing were highly graph­i­cal and seemed inter­de­pen­dent. Take, for exam­ple, a row of dif­fer­ent photo cat­e­gories (Abstracts, Peo­ple, Close ups, B&W, etc) to choose from where each category’s photo rep­re­sen­ta­tion would be con­sid­ered an ele­ment to test. That seems like the appro­pri­ate place to run a full-factorial because you may not want 2 pic­tures that look very sim­i­lar to appear side-by-side. How­ever, there are trade-offs to ded­i­cat­ing the time and traf­fic to full-factorial. You most likely have to severely limit the num­ber of ele­ments you will be test­ing at once. You may also have to forgo cus­tomer seg­men­ta­tion unless you are one of the few com­pa­nies with the ben­e­fit of mil­lions of vis­i­tors a day.

I think that one of our own cus­tomers actu­ally summed it up best for me last week. John Pace, a true cham­pion of test­ing and the head of opti­miza­tion at Real Net­works, likened fractional-factorial test­ing to a barom­e­ter. He’s a sail­ing man, so for­give me if the anal­ogy doesn’t sync up for you. A barom­e­ter mea­sures atmos­pheric pres­sure, but its value is not so much in the pre­cise mea­sure­ment as the noti­fi­ca­tion that there is a direc­tional change in pressure.

In much the same way, test­ing is sup­posed to give you direc­tional feed­back on what is per­form­ing and res­onat­ing best with your vis­i­tors. Test­ing is not a doc­u­ment or proof that you can use to be 100% sure of how your vis­i­tors will behave mov­ing for­ward. Because of that, I ques­tion how valu­able it is to spend 5 months run­ning 1 sin­gle test for learn­ings that may no longer be applic­a­ble by the time the test has com­pleted and the data pumped through analy­sis. Instead, why not take the win­nings and learn­ings of your week-long fractional-factorial mul­ti­vari­ate test and then run another test that builds off that new and improved base­line. If you can approach your test­ing pro­gram that way, I’m con­fi­dent that you’ll find more upside in both lift and learn­ings in the same 5-month period.

At the end of the day, no mat­ter which method­ol­ogy you end up using, the race for con­ver­sions and rev­enue is not going to be won by the might of your sta­tis­tics. Your strengths should be cre­ativ­ity, inno­va­tion, and the abil­ity to lis­ten and react to your cus­tomer. Employ­ing those in test­ing will get you to the fin­ish line; it’s just a mat­ter of whether you get there sooner or later than the rest of the pack.

Def­i­n­i­tions

Mul­ti­vari­ate Test: A mul­ti­vari­ate (MVT) test enables you to test mul­ti­ple ele­ments simul­ta­ne­ously. A mul­ti­vari­ate test exam­ple would be to test the ban­ner, head­line, copy, and call-to-action on a land­ing page. The ben­e­fits of run­ning a mul­ti­vari­ate test are that you can test more ele­ments at once than an A/B test, and you also get infor­ma­tion about which ele­ments were most sig­nif­i­cant and which alter­na­tives pro­duced the most lift.

Full-Factorial Design: Full-factorial design tests all of the dif­fer­ent com­bi­na­tions of ele­ments and their alter­na­tives. For exam­ple, if you had 7 ele­ments on a page with 2 alter­na­tives each, a full-factorial design would test all 128 (27) combinations.

Fractional-Factorial Design: Fractional-factorial design tests a sub­set of all the dif­fer­ent com­bi­na­tions of ele­ments and their alter­na­tives. In the same test exam­ple above, a frac­tional– fac­to­r­ial design using the Taguchi method would test 8 combinations.

  • http://www.kaushik.net/avinash Avinash Kaushik

    Argu­ing between full and par­tial fac­to­r­ial is akin to argu­ing how many angels and fit on a pin. (Appar­ently six : )).

    It is really a dumb exer­cise to argue about method­ol­ogy, in any sce­nario. Some­times full might make sense, other times par­tial. Hence I am usu­ally pleased when I have a choice (in the Google Web­site Opti­mizer I have a choice to use full or switch to the Sec­tion report and see the data as if I were run­ning a par­tial fac­to­r­ial experiment).

    There are so many more things that make or break a test­ing strategy.

    Are you test­ing ideas to fix your cus­tomer prob­lems (or ones you are dream­ing up)? Are you try­ing rad­i­cal changes or shades of green on a but­ton? Can you deploy tests fast or it takes nine years? Can you track mul­ti­ple types of goals or just one? Is your mom happy or sad?

    So many more pro­duc­tive ways to spend our lives. Lets move to those and put silly things like full or par­tial behind us. It sim­ply dis­tracts our cus­tomers from cre­at­ing a pro­duc­tive test­ing strat­egy. Then no one wins.

    Nice post Lily.

    –Avinash.

    PS: In case Omni­ture allows links to Google :) , I love the expla­na­tion of the choice here (and espe­cially the car exam­ple that even a lay per­son could understand):

    http://​www​.google​.com/​s​u​p​p​o​r​t​/​w​e​b​s​i​t​e​o​p​t​i​m​i​z​e​r​/​b​i​n​/​a​n​s​w​e​r​.​p​y​?​h​l​=​e​n​&​a​m​p​;​a​n​s​w​e​r​=​7​4​818

  • http://blogs.omniture.com/author/lchiu/ Lily Chiu

    Avinash — thanks for the insight­ful com­ments! Let’s hope this war finds a peace­ful end­ing soon :)

  • http://www.analyticsevolution.com John Lovett

    Hi Lily, I couldn’t agree more with your 4th point, which is really the sum­ma­tion of your pre­vi­ous three, in that “there is no magic [sta­tic] for­mula to the Inter­net”. Test­ing is a process of con­tin­u­ous improve­ment, no mat­ter what method­ol­ogy you use. What works today may not be effec­tive tomor­row and you won’t real­ize this unless you’re actively look­ing. How­ever, if tests are painful to imple­ment and high con­fi­dence inter­vals require lengthy test cycles, the thought of iter­a­tive test­ing is excru­ci­at­ing to some. In my expe­ri­ence, test­ing can inflict paral­y­sis among mar­keters. Col­lec­tively, mar­keters need to get test­ing, learn from the results and take action…. After that it’s a mat­ter of rinse, lather repeat.

    Here’s to a squeaky clean Web,
    John

  • http://testingblog.widemile.com Billy Shih

    Thanks for writ­ing about this topic, Lily. I’m glad to see more dis­cus­sion about this.

    I agree with the major­ity of your post, espe­cially if you’re talk­ing about a/b split test­ing as a form of full fac­to­r­ial testing.

    My main prob­lem with full fac­to­r­ial test­ing is the amount of weight given to inter­ac­tions. Proper method­ol­ogy requires one to do some big idea test­ing through split or even 1–2 fac­tor full fac­to­r­ial, such as your pho­tog­ra­phy exam­ple, and then doing a frac­tional fac­to­r­ial mul­ti­vari­ate test on the win­ning page.

    You are right, there are times when full fac­to­r­ial is use­ful, but the prob­lem is that the major­ity of peo­ple doing test­ing are doing tests on mul­ti­ple fac­tors on a page and get­ting into sit­u­a­tions like the one you men­tioned above, requir­ing 150 days worth of data.

    Tools, like Google Web Opti­mizer, can not do frac­tional fac­to­r­ial test­ing. If you’ve watched any of the webi­nars that Google has done with GWO, you can see that the test­ing done in-house by Google are tests that would be done much quicker with a frac­tional fac­to­r­ial tool.

    So while I do push hard for frac­tional fac­to­r­ial, it is not because I nec­es­sar­ily think there is no place for full fac­to­r­ial. More so I believe the indus­try is still young and most mar­keters don’t real­ize that there are 2 choices and addi­tion­ally those that expound full fac­to­r­ial are mis­lead­ing peo­ple into think­ing that that inter­ac­tions are the end-all-be-all and that frac­tional fac­to­r­ial gives bad results. The time saved by using frac­tional fac­to­r­ial can not be empha­sized enough.

    Best of luck with your test­ing and opti­miza­tion :)
    Billy

  • http://blogs.omniture.com/author/lchiu Lily Chiu

    John — thanks for high­light­ing the poten­tial for test­ing “paral­y­sis”. I often see the polit­i­cal con­se­quences of test design severely under­es­ti­mated by mar­keters who are just get­ting their feet wet with opti­miza­tion. I can’t stress how impor­tant it is to start sim­ple with high-value and easy-to-implement tests even if every­one is clam­or­ing to see the test that changes 15 dif­fer­ent ele­ments with 10 vari­a­tions each. At the end of the day, we all have a short atten­tion span. The same peo­ple who wanted the big and com­pli­cated test will have found some­thing else to care about by the time that test reaches sta­tis­ti­cal con­fi­dence months later.

    On a side note, here’s a post I wrote a few months ago that talks about how to avoid test­ing paral­y­sis when you’re just get­ting started:
    http://​blogs​.omni​ture​.com/​2​0​0​8​/​0​5​/​2​9​/​h​o​w​-​t​o​-​m​a​k​e​-​t​e​s​t​i​n​g​-​s​u​c​c​e​s​s​f​ul/

  • http://blogs.omniture.com/author/lchiu Lily Chiu

    Billy — I agree that there is a lot of dam­ag­ing mis­in­for­ma­tion float­ing around out there about fractional-factorial analy­sis. In fact, the ide­ol­ogy injected into the debate looks a lot like our polit­i­cal land­scape these days! But any­way, that’s get­ting off-topic. :)

    In response to your Google Web­site Opti­mizer remark, I do want to clar­ify that while GWO does not allow you to run a fractional-factorial test design, it can still pro­vide fractional-factorial analy­sis given the equiv­a­lent amount of traf­fic. The con­se­quences of full-factorial (in terms of time and traf­fic required) come into play when you’re either wait­ing for a spe­cific com­bi­na­tion to reach sta­tis­ti­cal con­fi­dence or wait­ing to see the inter­ac­tion effects between ele­ments reach sta­tis­ti­cal confidence.

    There are also orga­ni­za­tional chal­lenges to con­sider when run­ning a lengthy full-factorial test. In the exam­ple of a 7 ele­ment x 2 alter­na­tive test, there is a sub­stan­tial increase in resources required to set up, QA, imple­ment, and ana­lyze a 128-combination full-factorial test vs. an 8-combination fractional-factorial test.

    As I replied to Avinash pre­vi­ously though, I’d be more than happy to see this method­ol­ogy debate end in an “agree to dis­agree” truce so we can move on to more impor­tant and valu­able topics!

  • Pingback: Multivariate Testing; Fractional-Factorial, Full-Factorial, Taguchi - Huh? | Josh Baker - Marketing Optimization

  • http://johnhunter.com/ John Hunter

    The advan­tages of designed exper­i­ments with frac­tional fac­to­r­ial designs is huge in most all real world sit­u­a­tions (where you have many poten­tially impor­tant fac­tors). You can read a num­ber of George Box’s papers on the power of designed exper­i­ments. He is widely seen as one of the top sta­tis­ti­cians of the 20th cen­tury. I am a bit biased as he, my father and Stu Hunter wrote Sta­tis­tics for Exper­i­menters together (which I also highly rec­om­mend :-) .

    If test runs cost next to noth­ing full fac­to­r­ial (with mul­ti­ple runs of com­bi­na­tions) might be fine. In many sit­u­a­tions there is a sig­nif­i­cant cost to addi­tional runs so you need to get infor­ma­tion that is worth that addi­tional cost (includ­ing time to design the tests and ana­lyze the results). If those costs a very small then they may be war­ranted even if the expected value in addi­tional infor­ma­tion is small.

    I think it is impor­tant for peo­ple to under­stand why they choose a given approach. They should be able to explain, why, in the sit­u­a­tion they are in which is the most effec­tive strategy.

    Chang­ing how con­tent is dis­played on a web page and var­i­ous other options is much dif­fer­ent than hav­ing to change man­u­fac­tur­ing processes.