Multivariate Testing – Preventing Type I & Type II Errors

,

Summary:

The Author is a former Partner Technology Manager, Sales Engineer and Account Manager at Google and YouTube. He is a Director & Co-Founder at Adottimo, a specialist programmatic agency based in London. He introduces Multivariate Testing as a practice of publisher revenue optimization and Conversion Rate Optimization (CRO) using the Scientific Method. He presents Multivariate testing and highlights issues which can arise from improperly constructed Multivariate tests (i.e. The Multiple Testing Problem.)

Introduction

When it comes to verifying phenomenon, in order for the theory to be accepted by Scientific consensus, it has to adhere to the principles of The Scientific Method, which is the standard scientific method of deducing Objective Truth. This method of rational, verifiable, and reproducible deduction, characterized by skeptical observation and rigorous testing – rather than conformity and bias – has for several hundred years distinguished science from pseudoscience or conjecture.

Objective Truth is True, regardless of personal belief or disbelief. The methods and tools of Science are uniquely conceived to seek out and establish Objective Truth.

In this article we will be visiting Multivariate Testing using Objective Truth (i.e. the invocation of the Scientific Method.)

No one Scientific Research result is “Truth” until it is verified by other scientific results, using a different experimental method. When we have the same objective results emerging from different Scientific experiments – then we have Objective Truth! A significant characteristic of Objective Truths is that they are not later shown to be false.

Scientific testing is about using data to challenge assumptions, test new ideas and arrive at Objective Truth.

 

As mentioned in a previous article (The Importance of A/B Testing In Publisher Revenue Optimization [1]), the Scientific Method is and will continue to be the Gold Standard methodology for empirical knowledge acquisition for the Sciences, as has been the case for millennia.

Multivariate Testing

Multivariate testing or multinomial testing is similar to A/B testing, but may test more than two versions at the same time or use more controls. Simple A/B tests by themselves are not valid for more complex phenomena, such as observational, quasi-experimental or other non-experimental situations – as is common with survey data, offline data, and other more complex phenomena.[2]

Note that certain sorts of analysis involving Multivariate Data, such as Multiple Linear Regression and Simple Linear Regression, are not considered special cases of Multivariate testing as the analysis is of multiple variables toward the outcome of a single (univariate) variable.

You May be Running More Tests Than You Think

Consider you are designing a website and you want 5 different elements per website (header, title, different call to action buttons, etc.) and running a CRO (Conversion Rate Optimization) test.

For each of the 5 elements you will have 4 different variations (e.g. 4 different titles, 4 different call to actions buttons, etc.) Simple math will tell you that you will have 4^5 (or 1024) different scenarios to test!

Here statistics gives us a clear warning: the more variations (comparisons) you make, the higher the probability that you obtain a false significant result. As a fundamental rule for any scientific experiment you will need to establish a control for each creative. e.g. a control header, call-to-action buttons, etc.

The goal of the Multivariate test is simple: ascertain which content or creative variation produces the best improvement in the defined goals of a website, whether that be user registrations or conversion rates.

As a general rule, you will need your tests to be statistically significant. A result has statistical significance when it is very unlikely to have occurred given the Null Hypothesis [3]. Here, the Null Hypothesis is a general statement or default position that there is no relationship between two measured phenomena (i.e. the Conversion Rate is unrelated to the changes in either of the different variations) [4].

Suppose, for simplicity, that you need 400 conversions per scenario in order to ensure that the data you are collecting is statistically significant.

This translates to 1024 (variations) * 400 conversions per variation = 1024*400 = 400,000 conversions.

If your website’s average conversion rate is 1% (which is ordinary), then you will need: 100*400,000 = 40,000,000 visits in order to gain confidence in your results!

If testing 1024 variations based on a simple assortment of 5 elements (header, call-to-action buttons, etc.) sounds difficult, imagine how much more complicated things get when you start adding variations in campaigns, products, offers, keywords! However, for some large companies these kinds of tests are used not just used frequently, but they are used as a normal way of determining optimal UX designs.

Companies like Amazon, Google, Netflix, Facebook and others employ multivariate testing extensively.

Sample Multivariate Test from Amazon – Variation 1

 

Sample Multivariate Test from Amazon – Variation 2

Some Fundamental Statistical Hypothesis Testing Terms

In statistics, the probability of Type II Error (false negative) is denoted by beta (β). Usually, increasing the sample size of your test will prevent type II error from happening.

Alpha (α) denotes the probability of Type I Error (false positive). You typically construct your test to keep it at a  significance level of 5% to minimize the possibility of type I errors.

The 5% significance level means that if you declare a winner in your test (reject the null hypothesis), then you have a 95% chance that you are correct in doing so. It also means that you have significant result difference between the control and the variation with a 95% “confidence.”

Multiple Testing Issues

Let’s begin with some elementary definitions:

When conducting a two-tailed test that compares two conversion rates of the control (ρ1) and the conversion rate for the variation (ρ2), your hypothesis would be:

Null hypothesis:  Η0: ρ1 = ρ2

Alternative hypothesis: Η1: ρ1 ≠ ρ2

Type I Error – Rejecting the null hypothesis (Η0) when it is true.

Type II Error –Rejecting the alternative hypothesis (Η1) when it is true. In other words, failing to reject the null hypothesis (Η0) when it is false.

 

To describe this more simply, a Type I Error is to falsely infer the existence of something that is not there (e.g. raising a false alarm with no danger in sight or put in other terms “crying wolf” with no wolf in sight.)  

Depiction of Type I Error (False Positive) – False alarm – just a Sheep in a Wolf’s clothing

A Type II Error is to falsely infer the absence of something that is there (e.g. failing to correctly raise an alarm in the presence of danger, i.e. failing to cry “wolf!” when a wolf is present.)

Depiction of Type II Error (False Negative) – There’s a Wolf among the Sheep

If the significance level for a given experiment is α, the experiment-wise significance level will increase exponentially (significance decreases) as the number of tests increases. More precisely, assuming all tests are independent, if n tests are performed, the experiment-wise significance level will be given by [6]:

P(TypeIError)=1-(1-α)^n

Where n is the number of variations in a test. So for a test that contains 10 different variations and a significance level of 5% (n equal 10 and alpha 0.05), the overall type I error increases to:

1-(1-0.05)^10 0.40 or 40%!

In other words, you’re more likely to get a better result using a coin toss!

The Probability Rises exponentially Reaching 90% on 50 Variations

Probability of Receiving at Least on Type I Error with n Different Variations, alpha = 0.05

Note that within 135 variations, the possibility of obtaining at least one Type I Error is 99.9% (using a 5% significance level.) With the CRO example stated previously (with 1024 variations) the probability of obtaining at least one Type I Error is 1 (100%)!

Solutions to the Multiple Testing Problem

Possibly some of the simplest solutions that work for a limited number of comparisons is using statistical methods of correction for multiple testing.

These include methods such as Bonferroni correction – which is used to counteract the problems arising from multiple comparisons. Bonferroni correction methods rely on the statistical adjustment of confidence intervals – alpha values (α)– and p-values to prevent Type I errors. The Benjamini–Hochberg procedure (BH step-up procedure) controls the False Discovery Rate (FDR) at level α. These methods rely on some statistical adjustments made to p-values with the goal of reducing the chances of obtaining false-positive results. These methods are quite technical, so we won’t elaborate on the formulas used to derive them.

A far simpler approach is to reduce the number of iterations being tested to a level appropriate to deduce a statistically significant result given the sample size. As discussed earlier, companies of the scale of Google, Amazon, Netflix and Facebook are able to run experiments using enormous sample sizes, so they are able to conduct multivariate testing using statistically significant samples.

Use of Multivariate Models

Statistics is a very advanced field. It has existed for hundreds of years, and the problems we encounter today have been similarly faced by some of the greatest minds of antiquity.

There are at least 18 different well established multivariate analysis models in the field of Statistics that are frequently used today [7].

There are numerous Multivariate analysis models, each with its own method of analysis. As an example Multivariate Analysis of Variance (MANOVA) extends the analysis of variance to cover cases where there is more than one dependent variable to be analyzed simultaneously. It establishes the relationships between dependent and independent variables. This is an extension of Analysis of Variance (ANOVA) – which is used for univariate analysis. Another similar model for multivariate analysis is Multivariate Analysis of Covariance (MANCOVA). An advantage of the MANCOVA design over the simpler MANOVA model is the ‘factoring out’ of noise or error that has been introduced by the covariant [7].

Closing Remarks on Multivariate Testing

  1. Introduce variations which are more radical departures to the control. This limits the number of variations and the possibility of obtaining issues arising from the multivariate testing problems mentioned above.
  2. Don’t change too many factors at the same time. The more variations you add to your test the higher your chances of obtaining both Type I and Type II Errors. As seen previously, the possibility of obtaining a Type I Error increases exponentially as the number of variations tested increases. By the 45th variation, there is a 90% chance of obtaining at least one Type I Error (using an alpha value of 0.05.) This increases to a 99% chance of obtaining at least one Type I Error with 90 variations.
  3. Increasing the sample size is always a good strategy for reducing both Type I and Type II Errors.
  4. Keep it Simple. There is no tangible benefit in unnecessarily increasing the number of variations tested unless you are doing so in accordance with Statistically sound theory. Otherwise you’ll obtain misleading results which are not considered Statistically sound or Statistically significant. As mentioned, statistics is a very developed field and there are several means of performing scientifically sound multivariate analysis tests.
  5. Formulate the correct hypothesis. Whether rejection of the null hypothesis truly justifies acceptance of the research hypothesis depends on the structure of the hypotheses. Rejecting the hypothesis that a large paw print originated from a bear does not immediately prove the existence of Bigfoot! Hypothesis testing emphasizes the rejection, which is based on a probability, rather than the acceptance, which requires extra steps of logic.

References

[1] The Importance of A/B Testing In Publisher Revenue Optimization, Adottimo, 2018.

[2] AB Testing, Wikipedia Online Encyclopedia

[3] Sirkin, R. Mark (2005). “Two-sample t tests”. Statistics for the Social Sciences (3rd ed.). Thousand Oaks, CA: SAGE Publications, Inc. pp. 271–316. ISBN 1-412-90546-X.

[4] Everitt, Brian (1998). The Cambridge Dictionary of Statistics. Cambridge University Press, UK. ISBN 0521593468.

[5] Peter Heinrich, A/B Testing Case Study: Air Patriots and the Results That Surprised Us, Amazon Developer Portat, Appstore Blogs.

[6] Multiple Hypothesis Testing and False Discovery Rate,University of California, Berkeley, Department of Statistics.

[7] Multivariate Statistics, Wikipedia Online Encyclopedia

The Importance of A/B Testing in Publisher Revenue Optimisation

Summary:
The Author is former Partner Technology Manager, Sales Engineer and Account Manager at Google and YouTube. He is a Director & Co-Founder at Adottimo, a specialist programmatic agency based in London. He presents A/B testing as a practice of publisher revenue optimization, which brings the field more inline with a broader evidence-based practice approach. He presents a few examples of why A/B testing important in publisher revenue optimization using case studies from companies such as Google and Netflix.

Introduction

I come from an Engineering (Electronic/Electrical), IT, Telecom and FinTech background, so my approach to publisher optimization tends to be scientific, specifically based on the “Scientific Method” – the most prevalent form of evidence based empirical testing used in the natural sciences. The Scientific Method has withstood the test of time, remained The Gold Standard empirical method of knowledge acquisition, and led the rapid development of natural science (in every natural scientific field) since at least the 17th century.

 

In some cases, many assumptions are made in online Publisher Revenue Optimization that take into account preferences, comparison with traditional print and TV media industry, etc., as an approach to publisher revenue optimization. Although those with sufficient experience are able to deduce specific themes, general best practice, etc. from extensive experience, to be most effective A/B testing should be performed continuously and the companies cited in this article use A/B testing as an ongoing process.

 

Publisher Revenue optimization has very simple goals, such as: “Which are the best performing Ad Unit sizes for this viewport size?”, “Do Responsive ad units actually increase my earnings?”, “Does Dynamic Allocation actually increase my revenue yield?”  or some slightly complex ones “What is the Optimal Min CPM for this ad unit in AdX?”. However, the questions are never ambiguous – since the motives are usually simple – “Does this make me more money!?” Luckily enough the answer to the last question is always binary.

 

This article is a guide (introduction) to those with limited experience in the field, or even the more initiated who are not yet convinced in the value of A/B testing. It provides some context using examples for why this type of empirical data acquisition/correlation inference – which in the context of publisher revenue optimisation tends to mostly be A/B testing – is important. This article is intended for a broad audience – if you’re receiving under a few million pageviews/month, or receiving hundreds of millions, technical or non-technical, business focused or not, the case studies apply regardless.

Modern Era – Why Should we Care?

Case Study 1: Google Search Ads

A/B testing for publisher revenue optimisation

I’ll begin with a simple anecdotal and well known story involving Google’s former outspoken VP for Search, Marissa Mayer (employee Number 20 and Google’s first female engineer)[4] who later became President & CEO of Yahoo!

 

Marissa Mayer, then Vice President of Google Search Products and User Experience (the former “gatekeeper” of the Google Homepage), famously led a Google test which generated $200M in revenue per year for Google using simple A/B testing only [4]

 

This test involved testing 41 different shades of blue advertising links in Gmail and Google search – and measuring which links and variations of different shades of blue generated the highest Click Through Rates (CTRs), and hence, revenue.

 

As trivial as color choices might seem, clicks are a key part of Google’s revenue stream, and anything that enhances clicks means more money.

 

Google set the color of the blue links according to the then highest CTR variation and were then able to generate an extra $200m a year in revenue from this single A/B testing experiment only!

According to a statement in The Guardian by then Google UK MD, Dan Cobley[4]:

“As a result we learned that a slightly purpler shade of blue was more conducive to clicking than a slightly greener shade of blue, and gee whizz, we made a decision.”

When measuring empirically verifiable phenomenon, the facts cannot be disputed. A scientifically sound and statistically significant A/B test on, as in the previous example, which color link produces a higher CTR, is extremely difficult to disprove. Each of the 40 variations was tested from a statistically significant sample size of 2.5% of the Google’s user’s base – millions of people per variation!

 

Context on Google’s Advertising Scale:

Alphabet (Google’s Parent company) is the largest “media company” in the World (by Advertiser Revenue), with over €81.97 Billion of earnings in 2017 from Media spend, ahead of 2nd place Comcast (€72.64 Billion), and 3rd place The Walt Disney Company (€50.26 Billion.)[9]

A/B Testing

An A/B Test is a controlled experiment which tests two variants (variant A, vs variant B), to measure the change (difference/delta) in the value of a metric (or variable). This metric may be (for instance in the Online Publishing industry): CTR rate, Pageviews, Sessions per user, Bounce Rate, etc. One of the variants (say variant A, is used as a control.)

 

In the case of the Google example cited earlier, the metric tested was the change in CTRs and the  40 different variants tested (in separate tests) were the different hues of blue advertising links. This a form of statistical hypothesis testing or “two-sample hypothesis testing” as used in the field of statistics[5]. Experimenters can utilize the Scientific Method to form a hypothesis of the sort “If a specific change is introduced, will it improve key metrics?”, evaluate their test with real users, and obtain empirically verifiable results.

A Harvard Business Review article in 2017, Titled “The Surprising Power of Online Experiments” stated The Major leading IT companies, Google, Amazon, Booking.com, Facebook, and Microsoft – each conduct more than 10,000 online controlled experiments annually, with many tests engaging millions of users [5]. Having worked within Google, I can say that their experiments can involve up-to Billions of users! It also goes on to mention Start-ups and large companies without digital roots, such as Walmart, Hertz, and Singapore Airlines, also run them regularly, though on a smaller scale. These organizations have discovered that an “experiment with everything” approach has surprisingly large payoffs[5].

As another example, in 2000, Google initially used A/B test to ascertain the optimal number of Search Engine Results to show per page. A/B tests have become vastly more complex since the early days with mulitvariable/multinomial testing becoming the norm.

Case Study 2: Netflix

In a popular case cited by DZone, 46% of respondents surveyed by Netflix said that making titles of Netflix content available to browse was the 1 thing they wanted to know before signing up for the service. So, Netflix decided to run an A/B test on their registration process to see if a redesigned registration process will help increase subscriptions.

Netflix Survey

Netflix created the new design which displayed movie titles to visitors before registration. The Netflix team wanted to find out if the new design with movie titles would generate more registrations compared to the original design without the titles, as users had requested. This was analyzed by running an A/B test between the new designs against the original design.

Hypothesis

Hypothesis:

The Test’s Hypothesis was straightforward: Allowing visitors to view available movie titles before registering will increase the number of new signups.

In the A/B test the team introduced 5 different variants against the original design. The team then ran the test to see the impact.

 

Results:

The original design consistently beat all challengers. Completely contrary to what 46% of visitors were requesting!

So, why did the original design beat all new designs although 46% of visitors said that seeing what titles Netflix carries will persuade them to sign up for the service?

We know that demonstrating value early in the onboarding process is a fundamental principle in the onboarding process. So how is it that one of the most data driven companies in Silicon Valley doesn’t do this?

 

Conclusions:

Netflix attributes the success of the original design vs the variants as follows:

1) “Don’t confuse the meal with the menu.”

“Netflix is all about the experience,” says Anna Blaylock, a Product Designer at Netflix wondered this too when she first started at the company. Just as a restaurant dining experience isn’t solely about the food on the menu, Netflix’s experience isn’t just about the titles[8].

2) Simplify Choices

Each of the variants built for the tests increased the number of choices and possible paths for Netflix visitors to follow. The original only had one choice: Start Your Free Month. Blaylock says that the low barrier to entry outweighs the need for users to see all their content before signing up.

3) Users Don’t Always Know What They Want

 A/B Testing in Publisher Revenue Optimisation

While survey had shown that 46% of users wanted to browse titles before signing up for Netflix, the tests proved otherwise. Testing reveals our assumptions. That’s why it’s so important to run tests and trust the data.

Anna Blaylock left her interviewers with a good quote:

“Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won’t come in” Isaac Asimov

 

Case Study 3: Microsoft – Bing

The Harvard Business Review article (cited earlier [5]) attributes about 80% of proposed changes at Bing are first run as controlled experiments [5].

It highlights as a first case that in 2012 a Microsoft employee working on Bing had an idea about changing the way the search engine displayed ad headlines. Developing it wouldn’t require much effort—just a few days of an engineer’s time— but as with a company of that scale – it was one of hundreds of ideas proposed.

This is something at the time the program managers deemed it a low priority. So it languished for more than six months, until an engineer, who saw that the cost of writing the code for it would be small, launched a simple A/B test—to assess its impact. Within hours the new headline variation was producing abnormally high revenue, triggering a “too good to be true” alert.

These alerts are usually used in Companies of the scale of Microsoft for detecting when something goes terribly wrong (or very right) – it’s a form of anomaly detection. Usually, such alerts signal a bug, but not in this case. An analysis showed that the change had increased revenue by an appreciable 12%—which on an annual basis would come to more than $100 million in the United States alone —without hurting key user-experience metrics! It was the best revenue-generating idea in Bing’s history, but until the test its value was underappreciated [5].

The article mentions that “Microsoft’s Analysis & Experimentation team consists of more than 80 people who on any given day help run hundreds of online controlled experiments on various products, including Bing, Cortana, Exchange, MSN, Office, Skype, Windows, and Xbox.”[5]

Miltivariate Testing

Multivariate testing or multinomial testing is similar to A/B testing, but may test more than two versions at the same time or use more controls. Simple A/B tests are not valid for observational, quasi-experimental or other non-experimental situations, as is common with survey data, and other, more complex phenomena [7]. Such cases are more likely to draw “incorrect results” using seemingly “sound math”, but the wrong formulas!

Note that simple linear regression and multiple regression, are not usually considered to be special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome variable given the other variables. Multivariate statistics already  has more than 18 well established models each with its own analysis.

Closing Remarks on A/B testing

I was fortunate to have personally met Marissa Meyer (then Vice President of Google Search Products and User Experience), I’ve learned a lot of lessons from her legacy, worked with some of her Product Managers at Google, worked with publishers of all sizes and strategic importance in Google’s Publisher team, as well as learn first-hand and on the job from a number of really top-notch coaches – some of the best in the world at Publisher Optimization.

 

The cases presented are lessons that serve as a non-technical overview to A/B testing, which is part of a larger publisher growth strategy. We understand that every business is unique and has different goals and objectives, skills and resources, budgets and audience.

 

Google’s Parent Company Alphabet is now and has been for at least 2 years, the largest “media” company in the World (by Advertiser Revenue), with over €81.97 Billion of earnings in 2017 from Media spend, ahead of 2nd place Comcast (€72.64 Billion), and 3rd place The Walt Disney Company (€50.26 Billion.) A breakdown on the earnings per company are available on this Statista Report.

As more and more advertising budgets are moving online it is increasingly important (especially for traditionally offline publishers) to focus on growing your online publisher revenues. However, it requires more sophisticated approaches (Programmatic RTB, Dynamic Allocation, Preferred Deals, Private Auctions, etc.) in addition to just A/B and Multivariate testing.

If you have read this post so far and are in the publishing business, we would love to hear from you. Please feel free to visit our website to contact us for a preliminary consultation.

About the Author: 

Tangus Koech

Tangus is a former Partner Technology Manager, Sales Engineer and Account Manager at Google and YouTube. He is a Director & Co-Founder at Adottimo, a specialist programmatic agency based in London.

During his last year in Google he managed very strategic media technology partners which contributed additional revenue of over US$ 700 Million in YouTube Advertising revenue annually, based in the US, United Kingdom, Germany, Japan, France and Netherlands – amongst Google’s highest revenue generating countries.

Tangus earned his stars as a monetisation expert during nearly 7 years at Google, working with publishers in both Developed and Emerging Markets. He made amongst the first Million-Dollar publishers in the African continent.

He is a Programmatic Expert in Google AdSense, DFP and AdX. Within Alphabet GOOG (NASDAQ), Google’s Parent Company, Tangus’ team – the Partner Solutions Organization – was responsible for accounts which collectively generated approximately 90% of all Alphabet’s Google/YouTube publisher revenue.

Some of his partners include WPP Group plc (UK), Opera Software ASA (Norway), Vodafone Group (UK), Orange Group SA (France), Endemol B.V. (International), GfK (Germany & Benelux), Intage (Japan). In Emerging Markets he has managed accounts for, among others, Comcast NBCUniversal (International), DStv (South Africa), eNews Channel Africa (eNCA, South Africa), Nation Media Group (Kenya), Standard Media Group (Kenya.)

 

References:

[1] “Ancient Egyptian Medicine”, Wikipedia Online Encyclopedia

[2]“Edwin Smith papyrus (Egyptian medical book)”. Encyclopedia Britannica (Online ed.). Retrieved 1 January 2016.

[3] “Edwin Smith papyrus”, Wikipedia Online Encylopedia

[4] “Why Google has 200m reasons to put engineers over designers”, The Guardian News and Media Limited, 5th February 2014.

[5] “The Surprising Power of Online Experiments”, Harvard Business Review, September-October 2017 Edition.

[6] “Encyclopedia of Machine Learning and Data Mining” (PDF), Kohavi, Ron; Longbotham, Roger (2017). “Online Controlled Experiments and A/B Tests”. In Sammut, Claude; Webb, Geoff. Springer.

[7] “A/B Testing”, Wikipedia Online Encyclopedia

[8] “The Registration Test Results Netflix Never Expected”, DZone, Kendrick Wang, January 4th 2016

[9] “Premium Leading media companies in 2017, based on revenue (in billion euros)”, Statista