Multivariate Test Results: Confidence, Stability & Determining a Winner

Great_results

Getting your MVT tests designed, developing your creative and implementing code are all just precursors to getting your test live. However, once your tests are live success will be determined by how well you did all those things. At this point you are just monitoring results – so relax and kick your feet up…if only it were that easy.

Over three years of doing MVT most of the questions I’ve received have been about determining when a test is over. This is because when a test is live, results are always going to be changing. But testing forever defeats the purpose. Every test needs a time to cut the cord.

So, how to figure out when you’re finished? You need to weight the following factors:

Confidence: Confidence is simply calculating the discrepancy between the results. The better the performance of one test recipe vs. another the more statistical confidence will be achieved. For example if our success metric is conversion rate and one test recipe is A at 5.0% and recipe B is 5.5% there will not be a lot of confidence (that B is 10% better). However, if A was at 5.0% and B was at 10.0% there would be a great deal of confidence (that B was 100% better than A).

Important Point: The confidence metric is based on the data that has been collected. This is not a predictive calculation.

Margin of Error: MOE simply looks at the confidence stats and factors in the amount of data that has been collected. The more data the smaller the increments for MOE. I generally don’t pay much attention to MOE as these swings can be very wide. I know some stat heads might get their panties in a bunch about this but as a marketer who relies on speed this can be a paralyzing metric since so much data is needed in most cases, even with fractional factorial testing.

Stability: Stability coupled with confidence are the two most important things to look at in determining if your test is over. There are two graphs you want to be looking to judge test stability. One is cumulative stability and the other is daily results. Let’s see what these reports look like in the Omniture Test&Target tool.

Cumulative
Cumulative_results

The main things we’re looking for in the cumulative reports are trending and consistency. Once things seem to level off for a period of a week or so, we’re looking good.

Daily
Daily_results_2

The main things we’re looking for in the daily results reports are outliers and fluctuation. Once we have a recipe that wins most of the days we’re looking good.

Account for Temporal Changes!

Generally a best practice is to let your multivariate tests run a minimum of two weeks. This way you can get week over week results and see if there are any strange temporal behaviors that could be skewing the results. Here it is helpful to look at the daily results. I’m hoping Omniture’s Test&Target will soon be able to graph results week over week (or in other comparative timeframes) like Google can.

Don’t look back!

Successful multivariate testing is about speed (how quickly), velocity (how many) and iteration (how intelligent) based on analytic data. I’ve never regretted stopping a test with a big winner because even after is test is done you are going to be monitoring results. More often than not early results hold up as winners even if the overall improvement levels subside a little bit. For best results I’d much rather run 10 small, quick tests over a month period than 2 large ones.

This post effectively wraps up my multivariate testing overview in six parts. My final thoughts:

Multivariate testing can be a tremendous amount of fun and get you great results but it requires highly dedicated marketers and great creative methodology. Matt Roche the founder of Offermatica once shared three learnings from his time building the most successful multivariate testing tool. I’ll end with his great advice for digital marketers.

1. Great marketing comes from great marketers, machines help them aim better

2. Engaged marketers lead to engaged customers

3. Speed is everything

Happy testing!

Comments

6 responses to “Multivariate Test Results: Confidence, Stability & Determining a Winner”

  1. Yoav Avatar
    Yoav

    Hi Jonathan,
    I’ve been following the MVT series and its great. Thank you!
    Yoav

    Like

  2. Billy Shih Avatar

    Hi Jon,
    I agree with much of what you’ve said throughout this series. Thanks for sharing your knowledge with the community. The more that people know about testing and optimization, the more success everyone will find online.
    I have a question though, you mention that should tests should run for 2 weeks at minimum but then say “For best results I’d much rather run 10 small, quick tests over a month period than 2 large ones.” I agree with the statement that tests should be run 2 weeks and also that many small tests are better than doing one large test, but your example is a bit contradictory. Can you define when it is reasonable to end a test in less than 2 weeks?
    -Billy

    Like

  3. Jonathan Mendez Avatar

    Thanks Billy.
    To answer your question what I’m referring to is not speed of testing (how long the tests run) but velocity (how many tests you are running concurrently). This is not a contradiction but two different methods of optimization.
    I’ll still take a swipe at your next question:
    It’s reasonable to end a test prior to 2 weeks whenever you feel you have enough signal to make a more informed and intelligent marketing decision than you would have made had you not tested.
    One of my favorite testing stories is the client who launched an A/B test on their homepage for 20 minutes one morning so they could decide what creative to use for their one-day sale.
    Knowing anything is better then only guessing.

    Like

  4. Billy Shih Avatar

    Ahh okay, I understand. Thanks for the response.
    That’s definitely a fun story. Wow a 20 minute test.

    Like

  5. Ankush Avatar

    Hi Jon,
    Your 6 part series on MVT was very informative and useful. Thanks !
    In part 1, you described the pros / cons of fractional / full factorial testing.
    Can you explain or elaborate on how much better is fractional factorial testing over full factorial for the same test LP, elements & same test period?
    Any examples would be greatly appreciated.
    Keep posting!
    Thanks
    Ankush

    Like

Leave a comment

Create a website or blog at WordPress.com