Show Notes
- A couple statistical methods are important because they prevent you from setting yourself up for failure in your marketing experiments.
- The two most important statistical methods, and the two discussed today, are: confidence (significance) and sample size
- Confidence is a number you calculate during and after your experiment
- The higher the confidence, the more likely you are able to reproduce these results, in other words, the more likely it’s not a fluke
- For reference read How Statistical Significance Makes Your Results More Reliable
- Use Iterative Marketing’s Confidence Calculator
- A sample size is an estimate of the size of an audience needed to have a statistically significant (high confidence) result
- Sample size is actually calculated before we run our experiment
- To calculate sample size, you need your control’s conversion or success rate and the minimum detectable effect you want to be able to measure
- We can calculate sample size online using Optimizely’s sample size calculator
Charity of the Week:
We hope you want to join us on our journey. Find us on IterativeMarketing.net, the hub for the methodology and community. Email us at podcast@iterativemarketing.net, follow us on twitter at @iter8ive or join The Iterative Marketing Community LinkedIn group.
The Iterative Marketing Podcast is a production of Brilliant Metrics, a consultancy helping brands and agencies rid the world of marketing waste.
Producer: Heather Ohlman
Transcription: Emily Bechtel
Music: SeaStock Audio
Onward and upward!
►▼Transcription
Steve Robinson: Hello, Iterative Marketers. Welcome to the Iterative Marketing Podcast, where each week, we give marketers and entrepreneurs actionable ideas, techniques and examples to improve your marketing results. If you want notes and links to the resources discussed on the show, sign up to get them emailed to you each week at iterativemarketing.net. There, you’ll also find the Iterative Marketing blog and our community LinkedIn group, where you can share ideas and ask questions of your fellow Iterative Marketers. Now let’s dive into the show.
Hello everyone, and welcome to the Iterative Marketing podcast. I’m your host, Steve Robinson, and with me, as always, is the cool, calm and collected Elizabeth Earin. How are you doing today, Elizabeth?
Elizabeth Earin: I am good, Steve. How are you?
Steve Robinson: I am doing good. I am very well. I’ve got to get out of the habit of saying doing good, right? It’s doing well. Is that proper grammar?
Elizabeth Earin: I am not the right person to ask.
Steve Robinson: I am apparently not either. So, we’re not here to talk about grammar. What are we here to talk about?
Elizabeth Earin: Today, we’re going to talk about statistics, which — I will be honest, when we first discussed this as a topic for the show, I was a little nervous. I was the person who dreaded statistics in high school and then again in college, somehow made it through to this day, cannot tell you how, and when we talk about it in the business sense, it gets a little scary. I see the value in it, but it still is something that I am not completely comfortable with, so I was nervous.
Steve Robinson: I will be honest, my relationship with statistics is not a whole lot better than yours. As much as I studied computer science and love calculus and love math, I always thought statistics was like this like phony fake math thing because the formulas are all sort of made up kind of, I don’t know. So yeah, not a fan either, but –
Elizabeth Earin: Well, that’s why I love today’s episode. And obviously, we will get into it in a few minutes here, but it takes what I think is a complex topic that I think if you ask a lot of marketers out there it’s a kind of a scary thing and it breaks it down into a way that, one, reinforces the value and, two, makes it actionable. And so I am excited tonight. I can’t wait to get feedback from our listeners and find out if they feel the same way.
Steve Robinson: Absolutely, and we’re really only talking about two particular methods within statistics today. Certainly there are other methods of statistics that are valuable to marketers, particularly if you want to get really geeky and into the data side of things. But for everyday marketers, if you just pick up these two things, it will make a huge difference if you’re executing any sort of experimentation or A/B testing.
Elizabeth Earin: And they are important, because they prevent you from setting yourself up for failure in your experiments which, as we have talked about before, experiments are the backbone of Iterative Marketing.
Steve Robinson: Yeah. The two things that we’re talking about in particular are confidence — or calculating your confidence or sometimes referred to as statistical significance; it’s actually something slightly different that we’re not going to get into, but two terms are kind of used interchangeably — and sample size. And if you don’t use these two particular techniques, you run the risk of really sort of invalidating your experiments. Either running experiments with no result or thinking you have got a result that isn’t really actionable.
Elizabeth Earin: So today, we’re going to go into what each of these are and what it means for you and your experiments, and then what happens if you ignore these things.
Steve Robinson: Kind of a big warning-warning sort of thing. So let’s start with confidence. What is confidence when we’re calculating confidence on an experiment?
Elizabeth Earin: So confidence is a number you calculate during and after your experiment, and it’s going to be the probability that if you carry this experiment out again, you would be able to reproduce the same results.
Steve Robinson: It’s essentially tied to the number of trials or the number of opportunities that somebody could succeed or fail coming into the experiment. So in the case of if you’re testing a banner ad, that the number of opportunities or trials is going to be the number of impressions that that banner ad receives, whereas the number of successes is the number of clicks that you got. If you’re testing an email, the number of trials could be the number of times that the email was sent, whereas the number of successes could be the number of times it was opened or the link was clicked inside. It really depends on what you’re testing, what the number of trials are, but the more trials you have, the greater your confidence in the result is. Conversely, it’s also tied to the percentage of difference in your winner and your loser in your A/B tests. So if B wins by a huge margin over A, then you can be more confident in general in your result than if it was a very slim margin that your version B beat your version A in your experiment.
Elizabeth Earin: Steve, let me ask you, what is a good goal? What should we be shooting for when we are talking statistical significance?
Steve Robinson: Your confidence number is going to be a percentage, right? So you can either be 0% confident. “This is a complete fluke. We can never reproduce this.” Or you could be 100% confident. “I know that this will always, for sure, 100%, ever be the truth.” You’re never going to get 100% confident. It’s mathematically impossible. You’d have to run the thing an infinite number of times to be 100% confident. So what you’re looking for is a number somewhere between generally 90% and 100%, and different industry professionals peg this number a little bit differently. I’d say the majority shoot the middle at about 95%, but if you really want to iterate quickly and you’re okay taking on some risk that may be this changes into exactly the right change, then you could go with a 90% confidence. If you want to be super conservative and know that every single experiment that you run is valid and we never make a mistake, some people run it in 99% confidence. It depends on what your goals are as far as speed of iteration versus being sure that you’re making the right move.
Elizabeth Earin: So one thing to note, what — I think that it’s extremely important that when you’re testing, to set a confidence goal and not declare a winner until you have reached that goal. And that’s because if you don’t set a goal, then you may want to compromise on kind of — on that to get the result of the experiment. And so to make sure that you do have that statistical significance that you are confident that this is not a fluke that this is something that you can repeat again, you want to make sure that you’re working towards that established goal.
Steve Robinson: Because as your experiment progresses, you’re going to be calculating the confidence. So you’re watching the numbers. You say, “Okay, B is beating A. B is beating A. Let’s go and test and see what our confidence interval is, what our confidence percentage is for whether B really does beat A.” And when you go and test that confidence, if it comes up just a hair short of 90%, you might be tempted on, “okay, we will run with that.” Or maybe it’s 88% or maybe it’s 87%. And if you don’t have a hard line that you have drawn in the sand, all the sudden, it becomes really easy to start looking at — anything over 50% seems like a good bet, right? And the reality is, no, you can really lead yourself astray.
Elizabeth Earin: So our listeners now know that this is important. It’s something they need to do when they’re going to be looking for something within 98% to 99% range, but how do they actually calculate it?
Steve Robinson: Yeah, if I were to give you the formula, you guys would all freak out and turn off the podcast right now. The formula for calculating confidence is actually pretty evil when it comes to math. The good news is that a bunch of other people have done this work for us, and you can go out and find a number of statistical confidence calculators out there. You can just Google it or we have one available ourselves, right, Elizabeth?
Elizabeth Earin: Yes we do, and you can find ours at bitly.com/conf-calc and will link to that in our show notes.
Steve Robinson: It’s important to note that, depending where you go to get your confidence calculator, if you compare them, you will likely get different results. And that’s because the math is rather complicated and there are a couple of different formulas out there to do the calculations. So, that’s okay. Pick a tool that you like and run with it, and just stay consistent with your tool and you’re fine. But if you, by some weird chance, decide to go and test a bunch of them, you will probably get different numbers.
Elizabeth Earin: So yesterday, I was actually on inbound.org and there was a very interesting discussion going on about the statistical significance. And one of the questions that popped up was what if I’ve been running my experiment for a while and I don’t have my confidence goal? And I think that’s a really great question, because I know we have run into that before when we run experiments and I am sure that some of our listeners either have, or if they start to run experiments in the future, this is something that they’re going to encounter.
Steve Robinson: Yeah, and I mean the key is if you don’t have enough trials in order to hit your confidence goal, increasing the number of trials will get you closer to hitting that confidence goal. So you just run the experiment longer, and that’s true to a point, right?
Elizabeth Earin: Yes. I think it’s very important to note that you don’t want to run your experiment too long. And we generally recommend running an experiment between two weeks and eight weeks. It may need to be longer for you. You’re going to know what’s right, but the reason that we have sort of settled on that eight-week timeframe for the length of it is that things shift over time. And there’s a probability the longer that you go, that other market forces, other things are going to be influencing this experiment that you’re running and so your data isn’t necessarily going to be apples to apples. Now maybe oranges come into play or a banana.
Steve Robinson: Yeah. And this sort of noise could include things like your public perception changing over time, or maybe there’s a seasonality component where that’s going to shift buying patterns. And what will happen is B starts beating A or A starts beating B. And then there’s also a technical component to this, because generally, when you’re setting up an A/B test, you try to make sure that the people that are getting the A version only ever get the A version. The people that are getting the B version only ever get the B version, but the technology behind the scenes that makes that happen — and most of our tools relies on something called cookies. And if you’re familiar with cookies, you know that, (a) you can clear your cookies at any given point in time and, (b) when you change devices or when you get a new computer or when you reinstall your browser or whatever you might do, you end up actually wiping all of your cookies, whether you want to or not. And so the longer our experiments run, the more likely that our audience has done something to mess with their cookies or change devices to the point where our A group isn’t only getting A. And now some of our A group is getting B, and our B group is not only getting B, some of them are getting A. And what that’s going to do is it’s going to really decrease that margin by which B would beat A or vice versa, which actually hurts your confidence. So by running too long, you end up hurting your confidence more than you are helping it.
Elizabeth Earin: So that’s why we don’t run longer than eight weeks. We also mentioned at the beginning, though, that we try not to run less than two weeks. And so what happens if you reach your confidence level in three days?
Steve Robinson: This is where you have another little anomaly that can occur. If you have a large volume of data, sometimes you can reach your confidence within three days, two days, one day, right? The issue here is you don’t know what might be special about that one, two or three days. The other thing is that, for most of us, day of the week actually has a big difference in our results. And so we recommend that you run at least two weeks, because you want to loop through each day of the week twice before you can really declare an overall winner. If you don’t loop through the day of the week at least twice, then there’s a possibility that something was special about that particularly short time period that impacted the results overall. Just as you want your sample size to be big enough, your number of trials to be big enough that you want to make sure that nothing was too special about the few people that happened to come in and be part of your experiment. You want to make sure that nothing is special about the time period either.
Elizabeth Earin: And we actually just had a conversation about this ourselves with one of our clients who is testing some pricing and wanted to end the experiment a little bit short. And we were dealing with — the period that we are looking at had a holiday in it, and so we didn’t necessarily — this wasn’t a normal period for us. And so to be able to just end it without extending that a little bit longer, we didn’t necessarily have the data that we needed to make the informed decision.
Steve Robinson: So I think that brings us to a great stopping point. Why don’t we take a quick break here and talk about how we can help some people.
Elizabeth Earin: Before we continue, I would like to take a quick moment to ask you Iterative Marketers a small but meaningful favor and ask that you give a few dollars to a charity that’s important to one of our own. This week’s charitable cause was sent in by Brian Tetapolis in Utah. Brian asks that you make a contribution to the Wildlife Conservation Society, an organization committed to protecting the world’s wildlife and wild places. Learn more at WCS.org or visit the link in the show notes. If you would like to submit your cause for consideration for our next podcast, please visit iterativemarketing.net/podcast and click the “Share a Cause” button. We love sharing causes that are important to you.
Steve Robinson: And we’re back. So before the break, we talked about statistical confidence and how it’s important to measure confidence as you are running your experiment and after it completes. But there’s another method that you can actually run before you start your experiment and that is a sample size calculation. Elizabeth, why don’t you explain just a brief bit what sample size is.
Elizabeth Earin: So a sample size is an estimate of the size of an audience that you need to have a statistically significant or high confidence result. And even though we’re talking about this after we addressed what confidence levels were, you’re actually going to do this first. You’re going to calculate this before you run your experiment.
Steve Robinson: And to calculate your sample size, you need two pieces of information. You need to know what has the success rate been in the past for your control in this experiment. So if you’re looking at a landing page, what has the conversion rate been? If you’re looking at a banner ad, what has the click-through rate been? If you’re looking at emails, what has your open rate been or click-through rate been? Whatever it is that you’re trying to improve, what has that success rate been in the past is going to be one of the two numbers. The other number that you need is a little bit less straightforward, and that is what is the smallest detectable change that you want to be able to measure. And so what that means is how exact do you need your measurements to be, because as your sample size gets bigger, your ability to measure small changes increases. And the reason for this is because if you have a small sample size, one outlier, one weirdo in your sample is going to throw off your entire measurement and it can throw it off by a decent margin if you have a small number of people. But if you have a big number of people, that one weirdo isn’t going to mess up your results by nearly as much. They can’t throw the average by as much. And so when you’re setting up your experiment, you need to kind of provide some sort of a guideline as to how small of a change matters to you in the grand scheme of things.
Elizabeth Earin: So let’s kind of give an example, maybe, to help put this into context. If your control has a 2% click-through rate and you wanted to detect 10% or greater change, the test will only be valid if your variable has a 2.2% click-through rate or higher, and so you’re looking at a 10% change in either direction.
Steve Robinson: And so it really comes down to how big of an improvement or how big of a change you need to have before this experiment becomes useful to you. If a 10% lift in your click-through rate or conversion rate or whatever is a win, then 10% is a great number to put there. But if 20% is what you consider to be a win and something that’s worth actually making the changes or worth implementing, then 20% might be a better number for you to put there, because it will reduce your sample size. You don’t need as many people, as many trials in your experiment in order to make sure that your results are statistically valid.
Elizabeth Earin: So to calculate your sample size, this is kind of what we have been talking about here, explaining that the two pieces that are going to help determine how to do this. We’re looking at the change that you want to have happen, a 10% lift or 20% lift or a 50% lift. Keep in mind that that’s a lot and you’re going to combine that with your conversion or success rate, the thing that you’re trying to change, and those together are going to help determine what your sample size is. Is that correct Steve?
Steve Robinson: That’s correct. So just as an example, and again, you’re going to be looking for an online calculator to do this, but if you plugged in a 2% conversion rate and needing a 10% lift, it would tell you that you needed 95,000 trials before you could have a reasonable confidence that your experiment was valid.
Elizabeth Earin: With the trials being whatever it is you’re measuring, the sessions or the clicks or impressions or whatever it is that you’re trying to make that change in.
Steve Robinson: Exactly, exactly.
Elizabeth Earin: You had mentioned a calculator. I know Optimizely has a great one and you can either just Google Optimizely sample size calculator, otherwise we will link to it in the show notes.
Steve Robinson: Yeah, and there’s a number of other great ones out there. Google, find one that you like. We know and trust Optimizely but go where you need to go. I feel like we have answered all of this stuff.
Elizabeth Earin: No, we haven’t, I think. So I am going – I will pop in. So I think using the calculator seems like pretty simple. You plug in the numbers, but a couple of things to keep in mind. Before you begin your test, run your numbers through the calculator to see if your expected traffic is going to be sufficient enough to run the experiment. And we have talked about how to know what those numbers are.
Steve Robinson: So regardless of which sample size calculator you use online, you’re going to want to use it before you run your trial. So you’re going to pop in your two numbers, your conversion rate which you can pull from your data in the past and then your minimum detectable effect, which, if you don’t know where to start, somewhere between 10% and 20% is a great shot. 20% is a fine number to run with if you don’t know where to start as far as that minimal detectable effect. It’s going to spit out a sample size that you need, and so that sample size is the number of trials or attempts or opportunities for somebody to have a success in your experiment. You’re going to take that number and look at it and say is that realistic? And oftentimes, that comes down to time, right, Elizabeth?
Elizabeth Earin: If you’re testing impressions and you know how many impressions your ad gets per day or how many sessions your landing page will get per day, then you can calculate how many days it’s going to take for that experiment to run. And based on what the calculator has told you, you’re going to be able to determine is this something that I can do? Is it something that I can see through and get the levels that I need to be confident in my decision?
Steve Robinson: And it really is just sort of a guideline. Unlike your confidence interval that has to be exact, you have to put a firm line in the sand and not take results that don’t meet your confidence goals. Here, if your number is in the ballpark of what you expect, it’s probably still worthwhile to run that experiment. Because again, if you have a high-margin result, your sample size can drop rather dramatically. And so if you think there’s a good chance that your experiment is going to result in a high-margin result, go ahead and run it even if you’re a little bit short of that sample size estimation within that eight weeks. If you’re way off though, no, don’t even bother. You’re just going to waste your time.
Elizabeth Earin: So I had mentioned this Inbound discussion that I was part of yesterday. And one of the things that came up, kind of the main point of this article, is what do you do with it? You just don’t have enough impressions or you don’t have enough sessions for your experiment. Do you just move forward? Do you cancel it altogether? Do you just not run it? And I think that that’s something that a lot of marketers are dealing with. How would you approach that?
Steve Robinson: Yeah. I was actually going to ask this question the other day when I was out speaking in Denver, and unfortunately the person who asked the question came from a non-profit. And in their case, they were trying to optimize their donation pages and they didn’t have any paid media behind them. And so my response didn’t help this person much, but I said if you are not getting the volume of people in order to be able to execute the experiment that you want to run, the simplest solution is to simply put some paid media behind it. Chances are the insights that you’re going to get out of it, at least if you’re testing for insights and not just things like button color, the insights you will get out of it are going to be worth that small investment in paid media to get the number of people to this site or in front of this ad or — you can’t really do with email, but whatever you need in order to make the experiment valid. If you can’t do that, then the message is unfortunately, I am sorry, you really shouldn’t be trying to run experiments if you don’t have the volume of respondents in order to be able to get a valid result. Otherwise, you really are unfortunately wasting time and resources.
Elizabeth Earin: So I think that leads perfectly into our final topic here. And what happens if I ignore these tools? What happens if you don’t use these?
Steve Robinson: Let’s start with the sample size calculator, because that’s kind of just where we were. If you don’t use a sample size calculator before you run your experiment, it doesn’t mean you’re going to fail. Your experiment still could turn out very well, but you’re sort of flying blind because you don’t know going into it if there’s an opportunity for you to have a statistically-confident result. And so if you don’t use a sample size calculator before you begin your experiments, you run the risk of wasting resources. You waste your time in setting up the experiment and analyzing the results for something that’s not going to ever produce a result, and more importantly, there’s an opportunity cost because there’s a limited number of experiments that you can run in a given year, because you can’t run two experiments testing the same landing page or ad at the same time. There’s only so many you can fit in, and every time you attempt one that is destined for failure, you’re taking up space for one that could be successful. And so that’s really where running the sample size calculator comes into play. Elizabeth, what’s the downside of not testing for confidence?
Elizabeth Earin: If you don’t use a confidence calculator, you run the risk of making the wrong call based on data that looks like it’s telling a story, but it’s really just a fluke. And it’s happened to all of us. We have had that one huge sale that’s coming in. We want to make decisions off of this one data point that isn’t necessarily in alignment with anything else that’s going on, and so when we’re not using the confidence calculator, we run a risk similar to that. It’s very dangerous and — that we set ourselves up to introduce some invalid insights into our knowledge base that, rather than moving us forward, takes us back a step.
Steve Robinson: So the bottom line is use these tools. They’re really helpful. I know they’re math. I know they’re statistics, and it’s kind of hard to understand exactly what’s going on under the covers there, but they really do set you up for success with your testing. I guess that brings us to the end of our episode. So I want to thank everybody for making time again for us this week. And until next week, onward and upward!
Elizabeth Earin: If you haven’t already, be sure to subscribe to the podcast on YouTube on your favorite podcast directory. If you want notes and links to resources discussed on the show, sign up to get them emailed to you each week at iterativemarketing.net. There, you’ll also find the Iterative Marketing blog and our community LinkedIn group, where you can share ideas and ask questions of your fellow Iterative Marketers. You can also follow us on Twitter. Our username is @iter8ive or email us at podcast@iterativemarketing.net.
The Iterative Marketing Podcast is a production of Brilliant Metrics, a consultancy helping brands and agencies rid the world of marketing waste. Our producer is Heather Ohlman with transcription assistance from Emily Bechtel. Our music is by SeaStock Audio, Music Production and Sound Design. You can check them out at seastockaudio.com. We will see you next week. Until then, onward and upward!
Leave a Reply