The title may seem a bit controversial, a fairly common question I get from large (and small) companies is—“Should I run A/A tests to check whether my experiment is working?”
The answer might surprise you.
I’ve been doing split and multivariate tests since 2004, and have watched every single one like a hawk. I’ve personally made every test in the book and wasted many days, in order to become better and to continue to hopefully improve my ability to run valid tests.
What does my experience tell me here?
There are better ways to use your precious testing time
It’s important to note, I don’t want to come across as saying running an A/A test is wrong —just that my experience tells me that there are better ways to use your time when testing. Just as there are many ways to lose weight, there are optimal ways to run your tests.
While the volume of tests you start is important, how many you finish every month and how many from those that you learn something useful from matters most.
Running A/A tests can eat into ‘real’ testing time.
The trick of a large scale optimisation programme, is to reduce the resource cost to opportunity ratio, to ensure velocity of testing throughput and what you learn, by completely removing wastage, stupidity and inefficiency from the process.
Running experiments on your site is a bit like running a busy Airline at a major International Airport—you have limited take-off slots and you need to make sure you use them effectively.
We’ll cover a lot of ground, including:
- What kinds of A/A test are there?;
- Why do people do them?;
- Why is this a problem?;
- The Dirty Secret of split testing;
- Triangulate your data;
- Watch the test like a Chef;
- Machine Learning & Summary.
What kinds of A/A tests are there?
A/A – It’s a 50/50 split
The most common setup here is just a 50/50 split in testing exactly the same thing. Yup. We run a test of the original page against itself. Why?
The idea here is to validate the test setup by seeing that you get roughly the same performance from each variant. You’re testing the same thing against itself, to see if there’s noise in the data, instead of signal.
In a coin flipping example, you’re testing that if you flip the coin a number of times, it will come out equally in terms of heads and tails. If the coin was weighted (like a magician’s special coin) then running the exercise would let you know there was some noticeable bias.
Running an A/A is about validating the test setup. People basically use this to test the site to see if the numbers line up.
The problem is, that this takes time that would normally be used to run a split test. If you have a high traffic site, you might think this is a cool thing to do—but in my opinion, you’re just using up valuable test insight time.
In my experience, it’s a lot quicker just to properly test your experiments before going live. It also gives you confidence in your test where A/A frippery may inject doubt.
What do I recommend then?
- Cross browser testing;
- Device testing;
- Friends & Family;
- Analytics Integration;
- Watch the test closely;
This approach is a lot quicker and has always worked best for me, rather than running A/A tests. Use triangulated data, obsessive monitoring and solid testing to pick up on instrumentation, flow or compatibility problems that will bias your results, instead of using A/A test cycles.
The big problem that people never seem to recognise is that flow, presentation, device or browser bugs are the most common form of bias in A/B testing.
A/A/B/B – 25% Splits
OK—what’s this one then? Looks just like an A/B test to me, except it isn’t. We’ve now split the test 25% into 4 samples, which happen to contain both A and B in duplicated segments.
So what’s this supposed to solve? It’s to check the instrumentation again (like A/A) but also confirm if there are oddities in the outcomes. I get the A/A validation part (which I’ve covered already) but what about the results looking different—in your two A and B samples.
But what if they don’t line up perfectly? Who cares—you’re looking at the wrong thing anyway – the average.
Let’s imagine you have 20 people come to the site, and five each of them end up in the sample buckets. What if five of these are repeat visitors and end up in one sample bucket? Won’t that skew the results? Hell yes. But that’s why you should never look at small sample sizes for insight.
So what have people found using this? That the sample performance does indeed move around and especially so early in the test or if you have small numbers of conversions. I tend to not trust anything until I’ve hit 350 outcomes in a sample and at least two business cycles (e.g. weekly) as well as other factors.
The problem with using this method is you’ve split A and B into 4 buckets, so the effect of skew is more pronounced, your effective sample size is smaller and therefore the error rate (fuzziness) of each individual sample is higher.
Put simply, the chances that you’ll get skew are higher than if you’re just measuring one A and B bucket. It also means that because your sample sizes are smaller, the error rate (the +/-) stuff will be higher on each measurement.
If you tried A/A/A/B/B/B You’d just magnify the effect. The problem is to know when the samples have stopped moving around—this is a numbers thing but also done a lot by feel for the movements in the test samples. The big prize is not about how test results fluctuate between identical samples (A/A/B/B)—it’s about how visitor segments fluctuate (covered below).
A/B/A – A better way
Suggested by @danbarker this works to help identify instrumentation issues (like A/A) but without eating into as much test time.
This has the same problem as A/A/B/B in that the two A samples are smaller and therefore have higher error rates. Your reporting interface is also going to be more complex, as you now have 3 (or in A/A/B/B, 4 lines) of numbers to crunch.
You also have the issue that as the samples are smaller, it will take longer for the two A variants to settle than a straight A/B. Again, a trade-off of time versus validation – but not one I’d like to take.
If you really want to do this kind of test validation, I think Dan’s suggestion is the best one. I still think there is a bigger prize though—and that’s segmentation.
Why do people do A/A tests?
Sometimes it’s because it is seen as ‘statistics good practice’ or a ‘hallmark’ of doing testing properly.
It’s also seen as a clean way of running the test to have a dry run before the main event. For me, the cost of fixing the car whilst I’m driving it (running live tests) is far higher than when stationary in the garage (QA).
For me, getting problems out of ANY testing is the priority and A/A doesn’t catch test defects like QA work does. It might be worth running one if you’re bedding in some complex code that the developers can re-use on later tests. I just can’t recommend doing A/A for every test.
What’s the problem then?
The problem is always eating real traffic and test time, by having to preload the test runtime with a period of A/A testing. If I’m trying to run 40 tests a month, this will cripple my ability to get stuff live. I’d rather have a half day of QA testing on the experiment than run 2-4 weeks of A/A testing to check it lines up.
The other problem is that nearly 80% of A/A tests will reach significance at some point. In other words, the test system will conclude that the original is better than the original with a high degree of confidence!
Why? Well, it’s a numbers and sampling thing but it’s also because you’re reading the test wrong. If you have small samples, it’s quite possible that you’ll conclude that something is broken when it’s not.
The other problem is—when you’re A/A testing—you’re comparing the performance of two things that are identical. The amount of sample and data you need to prove that there is no significant bias is huge by comparison with an A/B test.
How many people would you need in a blind taste testing of Coca-Cola (against Coca-Cola) to conclude that people liked both equally? 500 people, 5000 people?
And this is why we don’t test very similar things in split tests—detecting marginal gains is very hard and when you test identical things, this is even more pronounced. You could run an A/A test for several weeks LONGER than the A/B test itself and get no valuable insight, either on whether the test was broken or your ability to understand sampling <grin>
A good example here is that people who espouse A/A testing forget another bias in running tests. The Slower Converter and the Novelty Effect.
If you run a test for two weeks and your average purchase cycle is four weeks, you’re going to cut off some visitors to the experiment when you close the test. This is why it’s important to know your purchase cycle time to conversion as you might run an A/B test that only captures ‘fast converters’.
Ton Wesseling always recommends ( and I agree) that you leave an experiment running when you ‘close’ it to new visitors. That way, people who’re part way through converting can continue to see the experiment and convert after the end of the test. This is a way of letting the test participants flush through the system and add more sample, without showing it to new people.
If you’re optimising the end of the testing cycle, by understanding purchase cycles, isn’t there some sort of bias at the start of testing?
Well part of this is ‘Regression toward the mean’ which we see in tests for all sorts of things, and the second part is the novelty effect.
If James has been visiting the website for four weeks and is about to purchase, he’s been seeing the old product page for all that time. On his final visit before converting, he sees a brand new shiny product page that’s much better and is influenced to buy. Your friend Bob, meanwhile, has been seeing the same page for four weeks and when he arrives, he still gets the old (control) version.
This means that new people into the experiment also contains ‘old’ visitors who are later in their lifecycle. This novelty spike can bias the data early in your test—at least until some of the cycles are flushed through the experiment. In theory, you ought to start the test a few weeks early and cookie all visitors, so you can only put new visitors into your experiment, not those who might be hit by a novelty effect late in the purchase cycle, for example.
My point of showing these two examples, is that there are loads of sources of bias in our split testing. A/A testing might spot some big biases but I find that it’s inefficient and doesn’t answer everything that QA, analytics integration and segmentation can.
The dirty secret of testing
Every business I’ve tested with has a different pattern, randomness or cycle to it—and that’s part of the fun. Watching and learning from the site and test data during live operation is one of the best parts for me. But there is a dirty secret in testing – that 15% lift you got in January? You might not have it any more!
Why? Well you might have cut your PPC budget since then, driving less warm leads into your business. You might have run some TV ads that really put people off that previously responded well to your creative.
It might be performing much better than you thought. But you don’t actually know!
It’s the Schrödinger’s Cat of split testing—you don’t know unless you retest it, whether it’s still driving the same lift. This is the problem with sequential rather than dynamic testing—you have to keep moving the needle up and you don’t know if a test lift from an earlier experiment is still delivering.
You leave a stub running
To get around the fact that creative performance moves, I typically leave a stub running (say 5-10%) to keep tracking the old control (loser) against the new variant (winner) for a few weeks after the test finishes.
If the CFO shows me figures disputing the raise—I can show that it’s far higher than the old creative would have performed, if I had left it running. This has been very useful, at least when bedding a new tool in with someone who distrusts the lift until they ‘see it coming through’ the other end!
However, if you’re just continually testing and improving—all this worry about the creative changes becomes rather academic—because you’re continually making incremental improvements or big leaps.
The problem is where people test something and then STOP – this is why there are some pages I worked on that are still under test 4 years later – there is still continual improvement to be wrought even after all that time.
Products like Google Content Experiments (built into Google Analytics) and Conductrics now offer the multi-armed bandit algorithm to get round this obvious difference between what the creative did back then vs. now (by adjusting the stuff shown to visitors as their behavioural response changes).
I postulated back in 2006 that this was the kind of tool we needed – something that dynamically mines the web data, visit history, backend data and tracking into a personalised and dynamic split test serving system. Something that can look at all the customer attributes, targeting, advertising, recommendations, personalisation or split tests—and know what to show someone, at what time. Allowing this system to self-tune (with my orchestration and fresh inputs) looks like the future of testing to me:
[Reference article: Multi Armed bandits]
Triangulate your data
One thing that’s really helped me to avoid instrumentation and test running issues is to run at least two analytics sources. Make sure you completely use the split testing software capabilities to integrate with a second analytics package as a minimum.
Doing so will allow you to have two sources of performance data to triangulate or cross check with each other. If these don’t line up proportionally or look biased to an analyst’s eye, this can pick up reporting issues before you’ve started your test. I’ve encountered plenty of issues with AB testing packages not lining up with what the site analytics said—and it’s always been a developer and instrumentation issue. You simply can’t trust one set of experiment metrics—you need a backup to compare against, in case you’ve broken something.
Don’t weep later about lost data—just do your best to make sure it doesn’t happen. It also helps as a belt and braces monitoring system for when you start testing – again so you can keep watching and checking the data.
[Reference article: How to Analyze Your A/B Test Results with Google Analytics]
Watch it like a chef
You need to approach every test like a labour intensive meal, prepared by a Chef. You need to be constantly looking, tasting, checking, stirring and rechecking things as it starts, cooks and gets ready to finish. This is a big insight that I got from watching lots of tests intensely—you get a better feel for what’s happening and what might be going wrong.
Sometimes I will look at a test hundreds of times a week – for no reason other than to get a feel for fluctuations, patterns or solidification of results. You have to resist the temptation to be drawn in by the pretty graphs during the early cycle of a test.
If you’re less than one business cycle (e.g. a week) into your test – ignore the results. If you have less than 350 and certainly 250 in each sample – ignore the results. If the samples are still moving around a lot then – ignore the results. It’s not cooked yet.
Anyone with solid test experience knows that your data and response is moving around constantly—all the random visitors coming into the site and seeing creatives, is constantly changing the precision and nature of the data you see.
The big problem with site averages for testing is that you’re not looking inside the average—to the segments. A poorly performing experiment might have absolutely rocked—but just for returning visitors. Not looking at segments will mean you miss that insight.
Having a way to cross instrument your analytics tool (with a custom variable, say in GA) will allow you then to segment the creative level performance. One big warning here—if you split the sample up, you’ll get small segments.
If you have an A and a B creative, imagine them as two large yellow space hoppers, sitting above a tennis court. You are in the audience seating and you’re trying to measure how far they are apart. They aren’t solid spaces but are fuzzy – you can’t see precisely where the centre is – just a fuzzy indistinct area in space.
Now as your test runs, the position and size of these space hoppers shrinks, so you can be more confident about their location and their difference in height, for example. As you get toward the size of a tennis ball, you’re much more confident about their precise location and can measure more precisely how far apart they are.
Be wary of small sample sizes
If you split up your A and B results into a segment, you hugely increase the size of how fuzzy your data is. So be careful not to segment into tiny samples or just be careful about trusting what the data tells you at small numbers of conversions or outcomes.
Other than that, segmentation will tell you much more useful stuff about A/B split testing than any sample splitting activity—because it works at the level of known attributes about visitors, not just the fuzziness of numbers. When I get a test that fails, that should be of some insight to me but I always mine the segments to see what drove the average. That provides me with key insight about not only my hypothesis but how different groups reacted to my experiment.
And this is the most important bit—when you get a test that comes out ‘about the same’ as the original, guaranteed there is a segment level driver that will be of interest to you. The average test result might have come out ‘about the same’ but the segment level response likely contains useful insight for you.
Is the future of split testing in part automation? Yes—I think so. I think these tools will help me run more personalised and segment driven tests – rather than trying to raise the ‘average’ visitor performance. I also think they remove the need to have tools for personalisation, targeting, split and multi-variate testing – basically all experiments with ‘trying stuff on people’.
The tools will simply help the area of experimentation I can cover, a bit like going from using a plough to having a tractor. I don’t think it reduces the need for human orchestration of tests – just helps us do much more at scale than we could ever imagine doing manually.
What’s the best way to avoid problems? Watching the test like a hawk and opening up your segmentation will work wonders. QA time is also a wise investment as it beats the existential angst hands down, when you have to scrap that useless data four weeks later.
And thanks to many people for having the questions and insights that made me think about this stuff. A hat tip to @danbarker, @peeplaja, @mgershoff, @distilled and @timlb for refining my knowledge and prompting me to write this article.
Join the conversation
Add your comment
Interesting article but I see a lot of possible misunderstanding about the nature of online data collection as well as the real reason for variance testing. Note that I could not agree more about avoiding all of the methods you mention above simply because they provide no useful information, but you do miss actual variance testing and why it is used. It is not a measure of a test, it is a measure of the nature of online data collection. It is not about QAing the tool or the test, it is about calibrating your understanding of the data so that you can avoid irrational behavior and story telling.
To start with you are correct about sampling bias and regression to a mean, but you miss that there are a thousand small errors that happen in terms of online data collection. Like all samples you can calculate an approximate population error rate from the sample, which is what most confidence calculations do, but they don’t account for those small hick-ups that happen all the time. I have done variance studies on millions of users for some sites and they can have 2-3% confidence after 3 weeks at 99% confidence for the entire time. While you can argue that there is some distribution errors causing havoc with the population error rate calculation, you can not assume that it is all due to that. To make it worse depending on what the basis is of your calculation (user, visit, impression, etc…) you can get entirely wild variance as well as misrepresented outcomes.
The real key to understand is that you are trying to maximize your ability to make rational decisions, and understanding your data and when you can act is paramount to that. If you end up with a 30% lift, great, it doesn’t matter. but what about 10%? 5%? 3%? 2%? Each of those may or may not be actual impact and you need to know that. Even worse the compound effect of a Type I error (false positive) is such that people will try to incorporate that learning in other places and will not challenge conventional wisdom so that they are severely limiting the outcomes for their program. In other words, people will think they are right when they aren’t, and will continue on with what they wanted to do instead of changing their view of what works and what doesn’t.
This problem only gets worse when you are trying to do “40 tests a month”. The number of tests you run a month is a cover for not being efficient with resources or time, and blaming variance testing is the same as a bad driver blaming FORD. This screams that you are not testing to maximize efficiency but action. Mathematically you are always going to be much better off doing fewer tests with far more experiences then you are doing a bunch of just A/B tests. Not even counting for confirmation bias and all of those things, the math simply says that you are going to get far more consistent and definitely far more dramatic increases when you compare a multiple of options (fragility). Variance might not answer when my top choice is 40% better then control, but what about when I have one experience that is 42% better and one that is 39% better? That is the most common type of scenario when you are testing for efficiency and not just for action. While there is always a higher chance of type II errors (false negatives) as well as the type I error when it comes to too many comparisons, the key is to ensure discipline in action.
Another key thing to understand is that both your slower converter and novelty effect are mitigated and misleading if you are testing correctly. When you start a test you are taking people at all points in the cycle. Not everyone is on day 1 of their 40 day purchase cycle, you are taking a true random sample of everyone at that point. If you only influence people early, you will see that in the pattern of data. If you don’t, you will also see that. The same is true for novelty effect. Some people have been to your site before, some haven’t, but what is key is that you represent all of them in the actual outcome. Unless you are telling me that people who have been to your site have no impact on your business going forward, then overly steering the test towards people who are only starting creates a selection bias and can dramatically impact the value of any results you see. The real key is to pay attention to a cumulative graph of all your data over time to see if you see any major inflection points (strong changes in acceleration relative to each other) so that you can see short tail and long tail behavior patterns. Doing this will show things when you don’t expect them as well as show when things don’t change when you think they will. Never assume a behavior and never assume that the data will match either correlative analytic or qualitative feedback.
Now then, a real variance study requires that you run the test exactly as you would any other test, which means that if you normally run 6 experiences, do 6 experiences. If you normally run 10 experiences, then run it at 10 experiences. If you normally run only 2 or 3 experiences, stop what you are doing and re-evaluate your testing because I assure you are wasting resources and time. If you do a 6 experience variance test then you will get 30 points of comparable data. Graph it, look at it over time, look at the maximum and the average. You will find your normalization period from doing this. Some sites are 3 days, some 10 days, most are 5-7 days. Expect that your tests have to run at least 7 days past that normalization period. You don’t have to do this before each test but doing it 1-2 times each year and the very first thing out the door will help mitigate all sorts of bad decisions in the future. The problem with an A/A or a double blind (A/A/B/B) is that you only get 1 or 2 data points, and variance like everything else has a range and distribution, as well as a normalization curve. Just doing a A/A test like that is pointless and misleading.
Thanks for the long response! Some of the stuff that you mention there simply wasn’t enough room to write about, so let me break down my responses and cover each area you raised. Your comments are in >>
>> you miss that there are a thousand small errors that happen in terms of online data collection.
I didn’t talk about all of these but they are always there. Ronny Kohavi has covered many of them in some detail (I can share articles if you want) but one of the key problems is this. Instead of say, a drug trial, where I can pre-screen and control the population coming into a test – I have no such control over a randomised AB test. I can’t tell if the sample that liked A did so, purely because they were teenage girls – unless I know that there was that segment level response. In many cases, the A and B sample are quite simply, just not the same – even with large amounts of traffic – because the people are different in ways that our website segmentation doesn’t map onto (or doesn’t account for).
Another problem is that you may have bias that isn’t accounted for by running an A/A test – and that’s because the in page code handling fails on the B version when you publish it. It might be that Chrome doesn’t render the variant correctly but I often find this is a bigger source of problems (browser and device compatibility) than anything that an A/A test would typically find in terms of calibration. We both know that biases exist – it’s just that I know there are some practical issues that trump theoretical problems.
>> The number of tests you run a month is a cover for not being efficient with resources or time, and blaming variance testing is the same as a bad driver blaming FORD.
I’m not trying to trade volume for variance testing. I’m just saying that I’d rather have rock solid code, good analytics instrumentation, QA testing and careful monitoring than A/A testing adding to the elapsed time. If I’m going to run good quality tests with solid hypotheses, I would rather fail fast (with some errors) than run perfect tests. The problem with most clients today is they’re using 95% confidence as their stopping rule, which significantly ups the error rate.
I’m not blaming A/A tests for ineffiency – they have their uses. I’m just saying it eats test time and that’s not always welcome, particularly if you do it for almost every test (yes, I know people who do this) And in many cases, they obsess so much about the A/A test results not lining up, that no effing AB testing actually happens (laughs)>
>>Another key thing to understand is that both your slower converter and novelty effect are mitigated and misleading if you are testing correctly
True for the former and this is why I don’t do anything for the novelty effect in my tests. I guess I didn’t make the point strongly enough though – if you’re worrying about A/A testing then you also need to worry about how the sample works with purchase cycles. And that means thinking about how those different cohorts respond to one or more exposures to the test. In practice, most companies are not even testing for decent sample sizes, never mind avoiding using confidence level as a stopping mechanism (amongst many other self stopping problems). I’ve only just recently started to get people to think carefully about their cycles (short and longer) in terms of how they run their AB testing so I’m still not sure how A/A testing helps them here. Will it help them to stop testing for 2 weeks when their purchase cycle is 12 weeks? Probably not. And in such a short test, that novely effect is still a factor, particularly for companies that don’t test for long enough!
The point with the novelty effect is that it’s more pronounced when you start a test. Basically, a heap of returning visitors now see a different creative and this has a different reaction than visitors arriving and only seeing one treatment over their lifecycle. It does have an impact and takes time to flush through (as you correctly explain, it’s about capturing as much as possible of real behaviour across many stages of the cycle) – it’s just a theoretical point rather than something I do anything practical about.
As for people finishing a test, I can’t see why it’s not OK to leave people to finish the test, even if the test has closed to new visitor intake. You don’t have to do this but I think it’s cleaner rather than cutting off the test to everyone, including those already part of an experiment.
>> If you do a 6 experience variance test then you will get 30 points of comparable data. Graph it, look at it over time, look at the maximum and the average. You will find your normalization period from doing this.
This is what would be useful for you to explain – what’s the practical use for my clients of running this? Most will test for significantly longer than your ‘normalisation period + n days’ rule so what’s it telling them here? How can they use this data?
I’m not sure what you mean by ‘if you normally run 6 experiences, do 6 experiences’ – if you could explain those two, that would be helpful.
I’m struggling to see the value explained in a way that my clients (most importantly) can grasp and judge to be an effective use of test time. In practice, the biggest drivers of wasted test time for my work come from lousy session handling, randomisation, failure to understand business & purchase cycles, not knowing how long to run a test for or when to stop. Add in poor implementation of code in the browser or device handling and marketing activity changing over test cycles and you find that these ARE the *common* problems that *should* be avoided.
If you can explain the value to me in a way that’s worth sharing with others, I’ll heartily support this. I just haven’t hit a human digestible and compelling case yet but I’m happy to listen!
The biggest problem for those clients doing the A/A testing is that they get results that differ and then worry about the results in an AB varying. Quite right too – marginal gains in conversion rate or other metric are hard to spot. I’m not sure A/A answers this one either.
I’d also make the point that the article title should have been “A/A testing is a waste of time, as practised by most companies out there” who simply use it to ‘self convince’ that their test isn’t biased – that their data collection systems are working correctly and that the response is pretty similar.
If used in this way, it really doesn’t help much, especially if the other potential sources of problems are missed (session, browser, device, cycles etc.)
Thank you for replying to my rather lengthy response. There were a number of questions in there, some I will answer here, some may come in the form of an upcoming post. Before I get too deep however I want to again stress that what I am talking about is 100% focused on sites big enough to run a full program and not just do a test from time to time.
>>This is what would be useful for you to explain – what’s the practical use for my clients of running this? Most will test for significantly longer than your ‘normalisation period + n days’ rule so what’s it telling them here? How can they use this data?
>>I’m not sure what you mean by ‘if you normally run 6 experiences, do 6 experiences’ – if you could explain those two, that would be helpful.
You are trying to calibrate how you should act on data. If variance is too high, then no manner confidence calculation should ever be looked at (honestly most groups should never look at confidence unless they clearly understand when it can be useful). What you are trying to establish is the normalization period for the causal data (VERY different then analytics data) as well as the normal range and what is actionable for the data. Understanding variance and rules of action is one of the clearest signs that someone understands the difference between analytics and optimization and the various disciplines needed.
If you do this Day 1 while you are getting ready for everything else it doesn’t even have opportunity cost. What it will do however is allow you to know what you can do to maximize outcomes by telling you what the maximum amount of experiences you can reasonably run are, what the normalization period is, what the normal timeframe is for a test is, and what normal rules of action are for your site and experience. If you aren’t sure, then run it at 6 experiences (so in other parlance, an A/A/A/A/A/A test) so that you get 30 data points (all 6 compared to the 5 others). It is one of the most valuable things you can do because it allows you so much information about how to maximize the value of all future efforts.
>>As for people finishing a test, I can’t see why it’s not OK to leave people to finish the test, even if the test has closed to new visitor intake.
Closing versus ending a test… Different camps on this, really just flavor. If you are going to run other traffic into a test then it limits what traffic you can use reasonably to avoid cross pollination or makes you delay other tests. I would also argue that it adds a bit of bias against people who were already in the process when they first interacted with the test. It is not the end of the world but there are many arguments why that is a bad idea.
>>In practice, the biggest drivers of wasted test time for my work come from lousy session handling, randomisation, failure to understand business & purchase cycles, not knowing how long to run a test for or when to stop. Add in poor implementation of code in the browser or device handling and marketing activity changing over test cycles and you find that these ARE the *common* problems that *should* be avoided.
I agree that those are common time wasters, but I look at them as major red flags and as symptoms for real problems. The real things that are valuable are usually the things that clients never think about and aren’t really aware of. This is why education is so important and will cut off many of these things. Its the classic 80/20 rule, except it is more like 100% of what people want to focus on is pointless and taking 20% of your time to deal with and educate people and the real disciplines of optimization solves so many of those and future problems.
Too much time looking at cross system variance or at confidence, then you need to educate better on the nature of data, causal vs. correlative data, rules of action and the need for rational data. Too much time in QA or in resources to get a test live? You need to educate on discovery vs. validation, exploration, efficiency and data discipline. You can also use it as a time to get a good framework and expected workflow for any test so that it is not just random ad hoc resources. Failure to understand business cycles? Maybe it matters, maybe it doesn’t, looking at short tail and long tail behaviors over a few tests and doing a variance study makes that point moot, since those are what establish clear rules of action. I can not stress how often consultants miss the need for a real education program and think they are being productive when all they are doing is wasting time and cycles dealing with the symptoms instead of the disease. It is also a great sign for where the consultant needs to improve their knowledge if they don’t know how to handle any of those topics.
>> “A/A testing is a waste of time, as practised by most companies out there” who simply use it to ‘self convince’ that their test isn’t biased
Could not agree more, but if used proactively and with the proper discipline and as part of an education program, they are one of the most valuable actions that a group can take, especially when it is just starting out.
A/A/B/B is A/B
Put another way.
A1/A2/B1/B2 = A3/B3
A1+A2 = A3
B1+B2 = B3
You just put a line through the middle of A3 and made it A1 & A2, there’s no reason you can’t remove the line and bring the two back together.
In theory the right tool would do this automatically but you could do it manually if you wanted. Just join the metrics from A1 & A2 and do the calculation with B1 & B2.
This way you can validate your testing setup (to a limited degree admittedly) without wasting actual testing time.
@Dave W – exactly – that’s what my A/A/B is – a split of a1 and a2 at 25%.
@Andrew – yours in >>
>> If variance is too high, then no manner confidence calculation should ever
>> be looked at (honestly most groups should never look at confidence
>> unless they clearly understand when it can be useful).
Actually that’s one of the big problems – people place too much dependence on confidence calculations for declaring a test. The p value (as used by most ab testing solutions) just moves around too much, so isn’t reliable, especially early in a test with lower samples collected.
But the nub of the problem with your quote “if variance is too high” is how long you’d take to get to that position, confidently If you are running an A1..A6 test, you’d have to run this for at least as long as an AB would likely to take and perhaps much longer (all depends on visitors, conversions, cycles etc.). So if our test time to run the test is n weeks then the time to run an A/A is going to be n weeks or more.
In essence, it takes a long time to be happy about the variance in the test being an accurate measure of ‘variance’ as you have to collect a lot of sample, given that both creatives are the same.
The problem here is that any measured variance might be different the next time you measure (and the next). If a site has a volatile mix of marketing activity and customer traffic, you’d need to be A/A testing repeatedly and on lots of experiments to call your understanding of variance ‘current’.
To summarise though, I still can’t see the value. I know there is going to be variance in the data – that’s why we always try to push for non subtle changes, to get a more marked behaviour change. Many AB testing tools suffer from presenting tests as accurate in this scenario – when the gains are very marginal. If you’re nailing good 25% lifts in testing, that’s easier to spot and see non overlapping error bars for. If you’re declaring test after test that’s 1, 2 or 3% – you could be merely inflating false positives or negatives.
Optimizely have updated their stats engine to help here, although I think there’s a long way to go in terms of both tool vendors and education around experimentation.
Back to the value thing then. If my understanding of your explanation is correct, you’re saying that running this in spare time at the start of the test tells you how much variation to expect from traffic, that might distort (especially marginal gains) AB test results.
If that’s not what you’re saying then I’ll have to ask you to explain again (and if I can’t get it, my clients won’t get it either, or yours).
If that *is* what you’re saying, then I’d advance that this knowledge is ephemeral – the change in your traffic mix, activity, market, competition could change it in minutes, hours or weeks. The knowledge might be useful to me for testing out some code, to see there isn’t a huge mismatch in the data collection side – but it won’t be useful for anything else.
So – assume I’m convincing my clients to run an AA test. If I tell them “This will take as long as an AB test, I can’t explain why you need this but it’s really useful and by the way, the information might change so we’ll need to run it again later, repeatedly” then it won’t work.
I admire the level of sophistication and passion you put into your testing ideas and writing (I read some of your rants) but I come from a practical, rather than theoretical, school of experimentation. In this world, there is a hierarchy of problems and most of them are just about not doing things the right way. It’s human problems of stupidity, greed, impatience and bad stats – something I find everywhere, not just in AB testing .
We can certainly agree on that but I can’t agree on the priority of A/A testing over everything else that I have to continually avoid. I can’t see why or how I could sell doing this to clients, if I can’t explain it in terms they’d understand. And I can’t see why I’d waste the time running A/A rather than testing with discipline.
There is education required but if I get people to run A/A tests but they don’t run whole weeks, use tiny samples and close tests at 95% confidence then I’ve failed in my educational mission. It’s hard enough to get people to test in the right areas or have a solid hypothesis!
My analogy with the AB testing stuff is the vendors have massively democratised testing. Yay! But not. They just massively scaled doing stupid things with sampling. We can wish the universe would all go on a stats course but that’s impractical. It’s a bit like we’ve handed out light sabers to everyone and then get angst ridden when people arrive at the hostpial with limbs lopped off. It’s partly the fault of the vendors and partly the people using the products. Amen.
Back to the A/A testing though. It feels like we’re arguing here about the power output circuit of the light saber or the custom grip handle, rather than all the accidents happening. If doing this is vital, it should be explicable (and to marketers, not just practitioners like me).
To clarify again, I despise A/A testing. I like variance studies which are full and look at expected range and maximum values over time, but those require a lot of data and a lot of discipline. 99% of what is done are A/A tests which we both agree are a complete waste of time (I might be more against them then you).
>>If my understanding of your explanation is correct, you’re saying that running this in spare time at the start of the test tells you how much variation to expect from traffic
I am not saying before each test, I am saying day 1, you walk in the door and you don’t have another test ready or creative or anything. You run it while you are getting all of that other stuff ready. I preach the 30 min rule for all test ideas, but you still need to do the ground work before any org is ready to run with that. It by definition has to run longer then an average test (N+X in your parlance). You might do it again 6-12 months down the line in a lull but it is never something done each test. It really doesn’t change as much as people think except in extreme circumstances (very limited product catalog and very narrow marketing focus). A site that is a 3% average is going to be somewhere between 2.5-3.5% no matter when you test them.
I think we also missed a huge piece here, which is that even if the client doesn’t understand the immediate value, it opens the door for you the consultant to speak about does and doesn’t matter. It allows you to make it clear that all vendors (and I worked for one for 5+ years) are trying to make people think they are successful when they really aren’t, and to make something very difficult seem easy, when in fact it is not.
I usually start with why statistical confidence does not mean colloquial confidence, and why that is important and how important acting rationally and with a clear set of rules before a test launches is so vital. I then talk about data patterns, influences, and if we have representative data or not. All of those things are vital to have be understood so that when push comes to shove you can test what is needed and act on what is needed in a rational and efficient manner. The variance study helps here as well to know how much tings really do variate across experiences (not as much as most people think). Since 100% of the value of testing is from how you think and act on it and not just some random, here’s a good idea, lets see if it works, I make this the number 1 priority of all my work. People always want to know what analysis you do or what great best practice test idea you want to try, when in fact those are as close to meaningless as can be, so it is up to you to not just appease them but to instead do the hard thing and make them successful. Test ideas are fungible, getting results are not, so I prioritize and make it clear, even when the client I am working with hates it, what we need to get settled.
Remember that the real skill of a consultant is not in knowing a tool or doing an action, it is in the ability to say no and get the right action especially when the client doesn’t want to do it. Sometimes walking them through a sample experience with the correct number of options, a look at confidence and its many faults, a look at what the site values are and how they change, and how to look at causal data and not just correlative data is vital if for no other reason to open their eyes about just how off their thinking about optimization are.
I used to make A/A test if clients were watching the campaigns and wanted us to take actions if they saw 2 conversions more in one hour. It was to visualise the natural volatility of the conversion process. If after 2 weeks of testing, an A/B differnece like 105 vs 80 conversions occured, they started to understand that 2 or 5 conversions more have no statistical significance. So I actually made it to save my and my client’s time. It helped (a little).
Instead of A/A tests, you recommend to use tools like http://abtestguide.com/calc – what worries me is that tools like this don’t ask for standard deviation of the measured values. If the conversion process is more complex and it has higher volatility (standard deviation), we surely need more data to obtain significant result. Especially if we measure non-unique conversions (number of transactions) or the value of purchase. On the other hand, if the conversion is simple (click, subscribe etc.), the volatility is rather smaller and we should have significant signals faster.
So, these tools are convenient but I sometimes have doubts if we use the right model. Conversion is a complex process, influenced by many independent random factors, so the noise can be very high…
“@Dave W – exactly – that’s what my A/A/B is – a split of a1 and a2 at 25%.”
I think you’re missing my point though..
based on what you say in the article in the A/A/B section:
“You also have the issue that as the samples are smaller, it will take longer for the two A variants to settle than a straight A/B. Again, a trade-off of time versus validation – but not one I’d like to take.”
It won’t take longer to settle and there is no trade-off because you’re doing both at the same time
A/A/B/B or A-25%/A-25%/B-50% is also an A/B test when you add up the sample data to make A/B.
Therefore there is no loss.
If you exclude the admin of bringing the two tests together it’s a zero-sum calculation. It’s just an improvement, there’s no loss in any way…so, if you’re concerned about this and don’t mind the extra admin you should always do A/A/B or A/A/B/B.
Having this built into popular testing tools as an education/accountability tool would be very useful…
also..I don’t get emails when people comment on this which is a shame as it just happened that I came back across this article..
not sure I agree with your point that running A/A tests precludes you from running other tests on the same page. Whats wrong with running your A/A test in parallel? Since no page elements are modified, there won’t be any impact on the other test you’re running.
But I do agree that testing the split test tool is pretty much a waste of time as there is not going to be any underlying bias unless you’re setting up complicated tests. I think Airbnb had an article about their testing setup and how A/A tests allowed them to spot a confounding issue with their split testing framework.
I was interested in hearing your statements about how you see this related to machine learning as a whole, based on your introducing it earlier as one of your topics. However, that topic never came to fruition (or at least not as of 2015_07_4_1746).
Nonetheless, thank you for the discussion!
Is there a minimum amount of traffic to qualify for doing any split testing, or is it best to hit the ground running? As in as soon as the site is live and see what happens?
I think this is answered in this post https://cxl.com/statistical-significance-does-not-equal-validity/
@alamat – A great deal is already on cxl.com!
@michaelpacer – Thanks – I guess the problem there is that you’re running several variants (say 5) that are dynamically altered, in terms of the traffic that gets exposed to them. Spotting bias there for me is both a collection problem and also doing full QA on all the ‘experiences’ on various devices/browsers. So an A/A test might be useful to calibrate the collection of data for analysis – but it won’t cover you if the variants are broken that you push into a machine learning algorithm.
@MikeCRO – Yes you can run two tests, but that makes the sample smaller (as you can’t expose someone to both an A/A and an A/B at the same time) – so you could run both but then the test will take longer.
@Witold – Have you asked them about this? It’s probably beyond my knowledge to answer you here so could you ask the guys who made the tool? Your point about sample sizes good – I sometimes ask people to tell me their favourite colour in meetings and use this to illustrate how small changes to the numbers hits the percentages.
@daveW – you made the point that I should have made – that if you add the data together, it works. I assumed you were analysing these as separate samples for comparison.
@MikeCRO If you run an A/A test while running another test, the A/A test won’t hurt your other test. But your A/A test will no longer be A/A. It will mirror that other test.
What A/A/B/B does is indirectly control false positive risk (it places additional conditions on the subsamples, which can only happen at a lower alpha). The problem is it reduces false positive risk too much and inflates false negative rate. The same applies to A/A/B tests. Lowering alpha directly when planning the experiment is a better way of reducing false positive risk. I have some simulated examples on my blog.
“The other problem is that nearly 80% of A/A tests will reach significance at some point. ”
I don’t think that’s actually possible, if you really have no difference between your two tests.
This is simply because your test for statistical significance is designed explicitly so that if your two groups are identical, then you will see a statistical significant difference 5% of the time.
Oh it’s possible and true. From a study by booking.com guys (http://blog.booking.com/is-your-ab-testing-effort-just-chasing-statistical-ghosts.html):
They ran 1.000 controlled experiments where it’s known that there is no difference between the variants.
771 experiments out of 1.000 reached 90% significance at some point
531 experiments out of 1.000 reached 95% significance at some point
Try this simulator http://destack.home.xs4all.nl/projects/significance/
Comments are closed.