There’s been quite a storm brewing around the best methodology to use for multivariate testing: fractional-factorial vs. full-factorial. (For a quick primer, definitions of both are included at the bottom of this post.) I have to say that some of the arguments I’ve heard border on ideological in both their passion and rigor. Are there scenarios where one methodology makes more sense than the other? Absolutely. Is it possible that one methodology is right for every scenario? No. I have my own thoughts on when each approach is applicable, but I’d first like to see if we can agree on a few statements:

1) The Internet changes every day, whether it’s based on oil prices, local and global news, or your competition switching tactics and prices.

2) What worked on your site six months ago may not be the same thing that works today, and will most likely not be what works six months from now.

3) The most successful Internet marketers are light on their feet – agile, flexible, and able to adapt quickly.

4) There is no magic formula to the Internet. We cannot say that if A & B, then C will happen each and every time.

In a perfect theoretical world where none of the statements above were true, I would run full-factorial each and every time. That way, I get to understand which exact combination is best out of all the possible combinations of elements, and even calculate all the different levels of interaction between elements. However, we unfortunately don’t have the gift of infinite time when running tests and analyzing results.

I recently read a case study touting full-factorial and the 576 different combinations tested. It had great graphs and charts of data, but, in my opinion, there were 2 huge things missing:

1) How long did this test take to run? If I go by Google’s handy calculator, I would estimate it took nearly half a year:

I don’t know many companies who have the luxury of running a test longer than one month, let alone five months!

2) How did different customer segments perform? Were segments even set up and tracked? With 576 combinations to test, even setting up two coarse segments such as new visitor and return visitor would double the amount of time the test had to run. In this case, we’re now looking at closer to a year! How can any company with various acquisition points and customer behavioral segments run a test and not slice their population up to understand where the differentiation lies though? Consider the customers who search on “guitar center” vs. those who click a PPC ad after searching for “les paul guitar” – is it possible they might react differently in a test? I would say it’s quite likely.

Does all this mean there are no cases where full-factorial might be more effective? Not at all. I have recommended running a full-factorial to clients in the past when the elements they were testing were highly graphical and seemed interdependent. Take, for example, a row of different photo categories (Abstracts, People, Close ups, B&W, etc) to choose from where each category’s photo representation would be considered an element to test. That seems like the appropriate place to run a full-factorial because you may not want 2 pictures that look very similar to appear side-by-side. However, there are trade-offs to dedicating the time and traffic to full-factorial. You most likely have to severely limit the number of elements you will be testing at once. You may also have to forgo customer segmentation unless you are one of the few companies with the benefit of millions of visitors a day.

I think that one of our own customers actually summed it up best for me last week. John Pace, a true champion of testing and the head of optimization at Real Networks, likened fractional-factorial testing to a barometer. He’s a sailing man, so forgive me if the analogy doesn’t sync up for you. A barometer measures atmospheric pressure, but its value is not so much in the precise measurement as the notification that there is a directional change in pressure.

In much the same way, testing is supposed to give you directional feedback on what is performing and resonating best with your visitors. Testing is not a document or proof that you can use to be 100% sure of how your visitors will behave moving forward. Because of that, I question how valuable it is to spend 5 months running 1 single test for learnings that may no longer be applicable by the time the test has completed and the data pumped through analysis. Instead, why not take the winnings and learnings of your week-long fractional-factorial multivariate test and then run another test that builds off that new and improved baseline. If you can approach your testing program that way, I’m confident that you’ll find more upside in both lift and learnings in the same 5-month period.

At the end of the day, no matter which methodology you end up using, the race for conversions and revenue is not going to be won by the might of your statistics. Your strengths should be creativity, innovation, and the ability to listen and react to your customer. Employing those in testing will get you to the finish line; it’s just a matter of whether you get there sooner or later than the rest of the pack.

Definitions

Multivariate Test: A multivariate (MVT) test enables you to test multiple elements simultaneously. A multivariate test example would be to test the banner, headline, copy, and call-to-action on a landing page. The benefits of running a multivariate test are that you can test more elements at once than an A/B test, and you also get information about which elements were most significant and which alternatives produced the most lift.

Full-Factorial Design: Full-factorial design tests all of the different combinations of elements and their alternatives. For example, if you had 7 elements on a page with 2 alternatives each, a full-factorial design would test all 128 (2^7) combinations.

Fractional-Factorial Design: Fractional-factorial design tests a subset of all the different combinations of elements and their alternatives. In the same test example above, a fractional- factorial design using the Taguchi method would test 8 combinations.

John Hunter

The advantages of designed experiments with fractional factorial designs is huge in most all real world situations (where you have many potentially important factors). You can read a number of George Box's papers on the power of designed experiments. He is widely seen as one of the top statisticians of the 20th century. I am a bit biased as he, my father and Stu Hunter wrote Statistics for Experimenters together (which I also highly recommend :-). If test runs cost next to nothing full factorial (with multiple runs of combinations) might be fine. In many situations there is a significant cost to additional runs so you need to get information that is worth that additional cost (including time to design the tests and analyze the results). If those costs a very small then they may be warranted even if the expected value in additional information is small. I think it is important for people to understand why they choose a given approach. They should be able to explain, why, in the situation they are in which is the most effective strategy. Changing how content is displayed on a web page and various other options is much different than having to change manufacturing processes.

Lily Chiu

Billy - I agree that there is a lot of damaging misinformation floating around out there about fractional-factorial analysis. In fact, the ideology injected into the debate looks a lot like our political landscape these days! But anyway, that's getting off-topic. :) In response to your Google Website Optimizer remark, I do want to clarify that while GWO does not allow you to run a fractional-factorial test design, it can still provide fractional-factorial analysis given the equivalent amount of traffic. The consequences of full-factorial (in terms of time and traffic required) come into play when you're either waiting for a specific combination to reach statistical confidence or waiting to see the interaction effects between elements reach statistical confidence. There are also organizational challenges to consider when running a lengthy full-factorial test. In the example of a 7 element x 2 alternative test, there is a substantial increase in resources required to set up, QA, implement, and analyze a 128-combination full-factorial test vs. an 8-combination fractional-factorial test. As I replied to Avinash previously though, I'd be more than happy to see this methodology debate end in an "agree to disagree" truce so we can move on to more important and valuable topics!

Lily Chiu

John - thanks for highlighting the potential for testing "paralysis". I often see the political consequences of test design severely underestimated by marketers who are just getting their feet wet with optimization. I can't stress how important it is to start simple with high-value and easy-to-implement tests even if everyone is clamoring to see the test that changes 15 different elements with 10 variations each. At the end of the day, we all have a short attention span. The same people who wanted the big and complicated test will have found something else to care about by the time that test reaches statistical confidence months later. On a side note, here's a post I wrote a few months ago that talks about how to avoid testing paralysis when you're just getting started: http://blogs.omniture.com/2008/05/29/how-to-make-testing-successful/

Billy Shih

John Lovett

Hi Lily, I couldn’t agree more with your 4th point, which is really the summation of your previous three, in that “there is no magic [static] formula to the Internet”. Testing is a process of continuous improvement, no matter what methodology you use. What works today may not be effective tomorrow and you won’t realize this unless you’re actively looking. However, if tests are painful to implement and high confidence intervals require lengthy test cycles, the thought of iterative testing is excruciating to some. In my experience, testing can inflict paralysis among marketers. Collectively, marketers need to get testing, learn from the results and take action…. After that it’s a matter of rinse, lather repeat. Here's to a squeaky clean Web, John

Lily Chiu

Avinash - thanks for the insightful comments! Let's hope this war finds a peaceful ending soon :)

Avinash Kaushik