How to Calculate the Impact (ROI) of Testing

The gospel of testing has spread far and wide, and today you'd be hard pressed to find a company that isn't actively experimenting in some form.

Many of the world’s most valuable and fast-growing companies — Google, Uber, Amazon — credit their success to running thousands of experiments.

Yet in private, most marketers are worried about the returns they’re seeing from their testing programs.

To highlight just one recent example in one of our DMs: “[senior management] expect us to find huge wins within every test, when the vast majority of them are inconclusive”. Even when you do achieve hard won victories after months of grinding — +20% here, +10% there — the inevitable question from the CEO always comes: “why can’t I see it in the bottom line?”.

Here’s a secret: Marketers aren’t getting the big returns they expect from testing, because they don’t make big enough bets.

Big experiments require strong business cases and coordination across multiple stakeholders. The path of least resistance results in low-risk/low-reward tests like button colors, imagery, or other inconsequential things.

In this blog post we’re going to make the case for why and how you’ll be able to move your business forward with big experiments. We’ll make our case using an example of a big, risky bet: turning off your highest performing channel.

How do you know if your big bet will pay off? We’ll provide a simple framework to calculate the expected value of the information gained from the experiment. We’ll explain how to adapt the framework to your own use cases, and take a peek under the hood for those that want to dive deeper. Finally we’ll cover the contrarian insight that you should take bigger bets in times of uncertainty.

About the Authors

In this Post:

Why Most Testing is a Waste of Time
Why Marketers Don’t Make Bigger Bets
Examples of Big Bets that Paid Off
Case Study: Turning Off Your Best Channel
Other Examples of Big Bets
Measuring the Value of Anything
Decision Making In Uncertain Times

Why Most Testing is a Waste of Time

Most marketers have become more data-driven over the last decade. We’ve driven business growth and have seen rapid career progression as a result.

The current generation of marketers learned their craft during the rise of digital marketing, where tales abound of Google testing 40 colors of blue, Uber running over 1,000 experiments at any given time, and Amazon’s culture of tolerating failed experiments as a necessity for innovation. The effort to apply the scientific method to marketing comes from a good place but it can also be used as a crutch.

These days, modern marketers diligently A/B test every landing page, try multiple variations of each ad creative, and never send an email without at least 2 alternate subject lines. And it’s exhausting. It’s common to see 7 to 10 failures before finding a winning test variation, which can kill motivation for all but the most stats-hardened team members.

Those that have done a lot of testing eventually realize that there are diminishing returns to experimentation.

For companies following industry best practices, huge outcomes from testing are rare, and they get rarer the more tests you run. There’s a significant political and financial cost to every test, so below a certain line many things just aren’t worth testing.

In Michael Taylor’s time agency-side, he recalls a retargeting pitch his team made for a mid-sized B2B client:

“Clients are usually hesitant to test retargeting ads, but most will relent if we prove it’s driving a good return on investment. In my team’s impressive and comprehensive plan they had broken out six separate funnel stages, each with their own tailored creative, using holdout groups within custom audiences matched by email to measure the campaign’s incrementality. There was just one problem: when we ran the numbers, only 2 users a month would even see the final stage of the funnel! The test just wasn’t worth running, at least until we had 1000x more traffic to retarget.”

Marketers should invest in better measurement and attribution, but the really big step-changes in performance have a habit of making themselves known. When something big pays off, you’ll have no doubt that it’s working. You’ll see a spike in the numbers, you’ll hear glowing feedback from users, the CEO will pat you on the back (or give you a :thumbsup: emoji on Slack). When you have a big win you can put away the statistical significance calculator.

As Ernest Rutherford said: "If your experiment needs statistics, you ought to have done a better experiment.”

Why Marketers Don’t Make Bigger Bets

The problem is clear: marketers don’t take big enough swings in their marketing experiments. We’ve seen this again and again working with marketers across different verticals and a variety of company sizes. They’re so paralyzed by a fear of rocking the boat and potentially missing their goal for the quarter that they’re unable to make the types of big bets that could be transformational for their businesses.

Much of this paralysis is due to a lack of framework for determining the value of a test. This leads marketers to under-invest in campaigns with higher (but uncertain) payoffs in the long term, and over-invest in short term small wins. Too many people think that if they spend money on a test and it doesn’t work out that those dollars were wasted: that’s wrong! Those dollars provided incredibly valuable learnings for the business. In reality, many “failed tests” will actually create incredible amounts of value for the organization.

“Outsized returns often come from betting against conventional wisdom, and conventional wisdom is usually right. Given a ten percent chance of a 100 times payoff, you should take that bet every time. But you’re still going to be wrong nine times out of ten. We all know that if you swing for the fences, you’re going to strike out a lot, but you’re also going to hit some home runs.” – Jeff Bezos

In general, the tests that marketers are most comfortable with are simple to run, most likely to “succeed,” but don’t actually create much informational value. They’re designed to generate small wins that will move you up to “local optimum” within the framework of what they’re already doing, and nothing more (a small improvement but not a step-change winner). Running a large volume of small tests makes you look productive, even if it’s not moving the needle, so the more company politics your team is exposed to, the more likely they are to take safer bets.

Safe (but inconsequential) bets might include:

Button colors
Borders of ad images
Emojis in email subject lines
Below the fold copy on landing pages
Easy to measure paid channels

We’ve all done them. It’s smart to start with “quick wins”. These experiments usually don’t require too many layers of approval to run – in many organizations they require no permission at all – so naturally we move towards the path of least resistance. Unfortunately, as John Maynard Keynes said: “it is better for reputation to fail conventionally than to succeed unconventionally.”

Examples of Big Bets that Paid Off

We’re all pretty accustomed to smaller incremental experimentation so here are a few examplesof bigger wins that actually paid off.

Big Win: Google’s Ad Auction Model

We mentioned earlier Google’s famous test of 40 shades of blue, but they weren’t afraid of making big bets too. CEO Eric Schmidt initially objected to the auction model for selling ads: “I was absolutely convinced that this would bankrupt the company”. Thankfully they tested it anyway. Today it generates $209.5 billion in revenue for Google, and has been copied by every major digital ad platform.

Big Win: Organic Search for Groupon

How about Groupon intentionally de-indexing their website from Google for 6 hours, killing their SEO traffic temporarily? You might think they’re mad, or that this was a big mistake, but they learned a valuable and non-obvious lesson: 60% of “direct” traffic is actually organic search. Knowing this to be the case their SEO managers were able to justify significantly more budget for their work, and the business could be more aggressive in investing in the channel.

Big Win: Turning Off Ad Spend

It’s one thing to turn off SEO for a few hours, but how about dialing down your media spend on a more permanent basis? It seems counterintuitive to willingly decrease your own budgets, but with CEOs and CFOs looking more closely at digital spend and asking if it were actually generating incremental performance, you need to be proactive. Companies like P&G, Chase, Uber, eBay, and AirBnB have saved hundreds of millions of dollars out of their budgets after running “Turn-Off” experiments showing much of their budget was wasted.

Case Study: Turning Off Your Best Channel

Let’s imagine we’re marketing in a channel – for the purposes of this walkthrough we’ll imagine the channel is a hypothetical hot new social platform called FlipFlop. According to the in-platform reports, it appears to be incredibly high performing but we’re unsure if it’s having a truly incremental impact on our business. In the post-iOS14 world, you can’t just rely on one form of measurement, so you’ve built a marketing mix model which shows FlipFlop ads aren’t as impactful as their in-platform reporting claims.

How do you validate your hypothesis that FlipFlop ads are performing worse than expected? Assuming the channel is fairly large, a pretty simple way to get signal on this would be to simply turn the channel off. We’ll go dark in the channel for a period of time and then see if our overall level of conversions drop.

This test feels very risky. If we turn the channel off we might lose conversions, and we’ll miss our target!

However, it’s important to carefully (and numerically) weigh the benefits of this test. Let’s imagine a few different scenarios:

We go dark in the channel, see that conversions drop, get an estimate of incrementality, and turn it back on
We go dark in the channel, see that conversions stay flat, and learn the channel isn’t driving incremental conversions for us
Something in between 1 & 2

In scenario 2, the test is incredibly valuable because it means we can stop spending in that channel. The value of that learning is equal to the total amount of money we would otherwise continue to spend in that marketing channel.

In scenario 1 we lose a few sales from cutting our advertising budgets temporarily. So when we think about the value of the test, we need to compare the downside case (scenario 1) with the upside case (scenario 2). Here it seems like for a relatively small cost, we’ve got the potential for a huge upside.

People tend to underestimate the value of the learnings that come from a test like this so it’s important to underscore: the costs of the tests are temporary but the benefits of the learning are very long-lasting. It’s easy to forget that these anticipated costs are only incurred during the course of the test.

Since those long-lasting benefits can be difficult to quantify, they tend to get ignored completely. But making better decisions about marketing spend over the course of a year or more can be hugely impactful to the bottom-line of a business.

Let’s quantify this a bit. Going in, we are very uncertain about the true performance of this channel because of the conflict between in-platform reporting and the media mix model results. Let’s say we believe that the true incremental ROI is somewhere between 0x and 5x. And let’s imagine that we currently spend $10k a week on this channel. Let’s also imagine that we believe that going dark in the channel for two weeks will tell us what we need to know.

Costs of the test:

First off, there’s a negative $20k that we save when we go dark in the channel for two weeks ($10k x 2)
Scenario 1: in the worst possible scenario, we lose out on $100k of revenue (10k x 5 ROI x 2 weeks). Net this gives us a cost of $80k over the test period.
Scenario 2: in the best possible scenario we save $20k (during the test) plus $520k in marketing dollars in the next year alone (since we stop investing in this channel that doesn’t drive value for us).

So in scenario 1 we lose $80k, but in scenario 2 we gain $540k. This means that if we think that there’s anything greater than 15% chance that the ROI for this channel is truly 0, then this test is worth running!

The Big Bet Testing Calculator

While you can sometimes do these calculations on the back of a napkin, it helps to put it all on a spreadsheet so everyone can see (and contribute to) the numbers. If you’re here for mental models rather than spreadsheet models, you can skip this section.

You can see these calculations in this template: Big Bet Calculator.

Feel free to make a copy of this template to use in calculating your own big bets. The process works like this:

1. Define the worst case scenario

Break the problem down into its component parts, in our case Spend and ROI, and calculate what they would be in the worst case scenario (ROI = 0).

2. Estimate the best case scenario

What do you (or members on your team, consultants, etc) expect to be realistically achievable. If you think you’ve got a 1 in 10 chance of hitting that number, you’re in the right ballpark.

3. Calculate the expected value

Adjust the “Chance of Min” (probability of getting the worst case scenario) up and down until you find the point that the expected value is positive. i.e. the benefits of testing outweigh the costs.

By adjusting the “Chance of Min” — the chance the worst case scenario will happen — up and down until we get to 15%, we can see that that’s where the expected value becomes positive since 0.15 * $540k >$80k. We don’t know the real odds, but we don’t need to. If we believe the odds to be higher than 15% that we’ll get the worst case outcome, we can be confident the expected value of the test makes it worthwhile to run.

Other Examples of Big Bets

We’ve so far used the example of turning off a promising channel. However this framework applies to really any major strategic decision in marketing.

Here are a couple of examples of when it could be valuable to take a big swing:

Acquiring budget to test a new channel
Experimenting with changing the price
Or investing in a recommendation algorithm

In the case of testing a new channel, the example we just walked through is reversed. On one side of the ledger we’d have the cost and projected ROI from testing the new channel, on the other side is the steady state (or potential decline) we’d see if we didn’t run the test. We can get industry benchmarks for expected performance from blog posts or by talking to media buyers in our network.

For experimenting with pricing changes the thought experiment is actually far simpler. We can design a pricing survey to estimate customer willingness to pay, and that can give us estimates for how many customers we’d lose or gain relative to the average value of a customer in different pricing scenarios. We could grandfather in any existing customers for a period to ease the transition.

Most product tests of any significance require engineering work. It’s difficult to justify testing a new feature like product recommendations, because you have to actually incur the cost of building the feature in order to test it. However we can work out ahead of time estimates for how much this would improve conversion and retention rates, and work backwards from there to see if it’s worth the engineering effort.

Measuring the Value of Anything

Every big decision is unique in its own way, but the process is always the same: Calculate what happens in the worst possible scenario, what’s likely to happen in the best case, and then work out the expected value of the information gained from the experiment.

The thinking behind the framework comes from the book “How to measure anything” by Doublas Hubbard, who shares a more complex template on his website.

The main takeaway from the calculation we did is not the specific calculations, but more the mindset that anything can be measured if it’s important enough.

We took what seemed like a big, scary, uncertain problem — figuring out the impact of a new channel — and broke it down into a series of manageable steps, increasing our confidence in making an important decision.

“Anything can be measured. If a thing can be observed in any way at all, it lends itself to some type of measurement method. No matter how “fuzzy” the measurement is, it’s still a measurement if it tells you more than you knew before. And those very things most likely to be seen as immeasurable are, virtually always, solved by relatively simple measurement methods.” – Doublas Hubbard

The crux of the book revolves around reframing measurement from “answer a specific question” to “reduce uncertainty based on what you know today.” That’s a much simpler problem to solve, because if we can estimate the cost of being wrong about the decision, and the chances of being wrong based on what we already know (in Bayesian terms, our ‘priors’), we can calculate the ROI from running an experiment to reduce the uncertainty we’re facing.

Hubbard’s framework goes like this:

Define what you want to know. Consider ways you or others have measured similar problems in the past.
Start your ROI calculation by determining how important your project is, and what the value of the outcome would be for the business.
Estimate the uncertainty that you (and the wider team) feels about achieving that outcome
Figure out what level of certainty would make you comfortable enough to make a decision. Confidence is on a spectrum: it’s not a yes/no decision.
Determine how much it would be worth to spend on measurement to achieve that reduction in uncertainty, versus the potential cost of an experiment.
Reframe the suggested experiment as a way to increase your confidence in the decision, not a guarantee of absolute certainty.

Before a problem is modeled, it’s a black box: something you avoid thinking about because you’re afraid of what you might find when you do. The way to estimate the impact of big, scary experiments is to break the calculations down into smaller manageable variables. Once you’ve solved each smaller slice of the equation, it all adds up to a reasonable answer to something you thought unsolvable or unknowable before.

The trick is to allow yourself to be realistic with the best and worst case scenarios. You’ll often find that even the worst possible scenarios aren’t that bad, and the best cases are better than you thought. Yes your ranges won’t be 100% accurate — 100% accuracy isn’t the goal here — we’re just trying to estimate our current level of uncertainty, and the value of reducing it.

If you’re having trouble estimating uncertainty ranges for variables, talk to your wider team, mentors, fellow operators, and industry experts like consultants (benchmarking is one of the things they’re really good at). Part of the benefit of doing this exercise is to find where someone’s estimates deviate from the rest. This is almost always a sign that they know something others don’t, or haven’t been told something important.

If you do identify an outlier, have a discussion on why your beliefs differ: either you’ll be convinced, or they will, or your estimate will land somewhere in between. This estimate collection exercise can be a great way to tease that hidden knowledge out, and it ultimately leads to better estimates of your experiment’s potential ROI

The main takeaway from this section is that evaluating the benefit of even your biggest, scariest tests isn’t impossible.

Bigger bets are more accessible than you realize, once you model the best and worst case scenarios. Once everyone’s expectations are captured in a spreadsheet, it’s easier to be confident in the face of a bold test. By getting into the habit of calculating the expected return from each experiment, you’ll naturally start gravitating away from sweating the small stuff, and make bigger bets you need to kickstart growth.

Decision Making In Uncertain Times

At the time of writing there’s a war in Ukraine, we’re on the brink of a global recession, we’re dealing with record inflation numbers, and all of this is following a global pandemic (which isn’t exactly over). In uncertain times like these you might be tempted to be conservative and stick with what you know, but that’s the exact opposite of what you should do.

To illustrate this point, let’s talk about ants.

When ants find a new food source, they lay pheromone trails so that the rest of the colony can find it. Each ant that follows the trail reinforces it by leaving their pheromones, making the signal stronger. Most ants follow the trail, because it would be wasteful to diverge.

However some ants do peel off the path, seemingly at random. They explore new areas, rather than exploit the existing food sources the colony already knows about.

Now here’s the kicker: they explore and exploit at a rate directly proportional to the uncertainty in the surrounding environment. When food discovery gets unpredictable, ants collectively exhibit more exploratory behavior, so they aren’t stuck with a diminishing food source.

When there is more uncertainty, the value of information increases. Big risks you might not have taken in normal times, become essential “hail mary” plays that need to be made.

Knowing how to make the explore-exploit tradeoff is key to setting an effective strategy.

If your competitors are hunkering down, that’s an opportunity to gain an edge. What’s more, with shifting consumer sentiment, a lot of the results of previous experiments might be invalidating before your very eyes: the answer is more testing not less.

You want to be acting counter cyclical, like the sentiment echoed by Warren Buffet: “Be fearful when others are greedy, and greedy when others are fearful.”