Power Analysis

Reference: Probing into Minimum Sample Size by Mintao Wei

How to determine the minimum sample size required to achieve a certain significance level and power desired?

The following table helps us understand how type I and type II errors come into play:

	Null Hypothesis: A is True	Alternate Hypothesis: B is True
Reject A	Type I Error	Good statistical power
Accept A	Good significance level	Type II Error

Type I Error refers to rejecting the null hypothesis when it is actually true, e.g. when we think that an AA test has significant difference. In short, it means we were too eager to deploy a poor variant. This should happen with probability $α$ , which is the significance level which we set (typically 5%). We have a better handle on type I error because the baseline conversion rate is typically known prior to an experiment.

Type II Error refers to failing to reject the null hypothesis when the alternate is actually true, i.e. we failed to get a significant effect on an improvement that is known to be better. In short, we were too conservative and failed to deploy a winning variant. In order to reason about type II error, we need to make a guess on what is the distribution of test variant B. Typically, this is done by assuming a minimum effect $δ$ we wish to detect, and setting $μ_{B} = μ_{A} + δ$ , and re-using the standard deviation from A. With these assumptions in place, we use $p o w er = 1 - β$ to determine the type II error that should only occur with probability $β$ (typically 20%). Note that since $δ$ is the minimum effect we wish to detect, if the actual effect turned out to be larger, the type II error can only be smaller than our desired amount, which is ok.

Now we can derive the formula for the minimum sample size required to achieve the desired levels of type I and type II error respectively.

Let us define the baseline conversion rate as $p$ , and the minimum relative detectable effect rate as $d$ . Consequently, the minimum detectable delta is $δ = d \times p$ . Let the desired power level be $1 - β$ , and the desired significance level as $α$ . Assume the scenario where we are running an AA or AB test with two variants of sample size $N$ each.

Firstly, we write down the distribution of the sample mean difference supposing we knew the true population means and standard deviations. Let $E (X_{A}) = μ_{A}, Va r (X_{A}) = σ_{A}^{2}$ and $E (X_{B}) = μ_{B}, Va r (X_{B}) = σ_{B}^{2}$ . Note that $X_{A}, X_{B}$ may have arbitrary distributions, e.g. they could measure proportions, revenue etc.

Under the central limit theorem, the sample means will be distributed like so with $N_{A}, N_{B}$ samples: $\overset{ˉ}{X}_{A} \sim N (μ_{A}, \frac{σ _{A}^{2}}{N _{A}})$ , $\overset{ˉ}{X}_{B} \sim N (μ_{B}, \frac{σ _{B}^{2}}{N _{B}})$ . Importantly, the difference of the sample means will have the distribution below. Note that we add the variances together because $Va r (B - A) = Va r (B) + Va r (A)$ for any two independent random variables $A, B$ .

$\overset{ˉ}{X}_{D} = \overset{ˉ}{X}_{B} - \overset{ˉ}{X}_{A} \sim N (μ_{B} - μ_{A}, \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}})$

Now we can start working from the desired $α, β$ levels to the minimum sample size. We need to ensure that both objectives below are achieved with our sample size $N_{A}, N_{B}$ :

Assuming null hypothesis to be true, ensure that type I error $\leq α$ .
Assuming alternate hypothesis to be true, ensure that type II error $\leq 1 - β$ .

Let us define some notation first.

Let $z (ϕ)$ denote the critical value under the standard normal distribution such that $P (Z \leq z (ϕ)) = ϕ$ . This is basically the scipy.stats.norm.ppf function, e.g. $z (0.975) = 1.96$ .
We also want to denote the critical value under the distribution $\overset{ˉ}{X}_{D}$ of the sample mean difference under the null or alternate hypothesis (these are non-standard normal distributions). Let these be $z_{\overset{ˉ}{X}_{D} ∣ H_{0}} (ϕ)$ and $z_{\overset{ˉ}{X}_{D} ∣ H_{1}} (ϕ)$ respectively.


Illustration for Power Analysis Derivation

For objective 1, assuming the null hypothesis and using equation (1) above, we have $\overset{ˉ}{X}_{D} ∣ H_{0} \sim N (0, \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}})$ . Since $α$ is a two-tailed probability and we want the critical region on the right-side, let $α^{'} = 1 - α /2$ . E.g. $α = 0.05$ implies $α^{'} = 0.975$ . Then:

$z_{\overset{ˉ}{X}_{D} ∣ H_{0}} (α^{'}) = z (α^{'}) \cdot \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}}$

Note that equation (2) above tells us the critical value such that we will reject the null hypothesis if the sample mean of $B$ is greater than this value. To satisfy objective 2, we must thus ensure that the probability of rejecting the null hypothesis is at least $p o w er = 1 - β$ . In other words, we want $δ - z_{\overset{ˉ}{X}_{D} ∣ H_{1}} (1 - β) \geq z_{\overset{ˉ}{X}_{D} ∣ H_{0}} (α^{'})$ . Assuming the alternate hypothesis and again using equation (1), we have $\overset{ˉ}{X}_{D} ∣ H_{1} \sim N (δ, \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}})$ . So then:

$δ - z_{\overset{ˉ}{X}_{D} ∣ H_{1}} (1 - β) δ - z (1 - β) \times \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}} \geq z_{\overset{ˉ}{X}_{D} ∣ H_{0}} (α^{'}) \geq z (α^{'}) \times \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}}$

For the purpose of getting a minimum $N$ , we assume $N = N_{A} = N_{B}$ . Then using this and squaring both sides with some rearranging gives us:

$N \geq \frac{( σ _{A}^{2} + σ _{B}^{2} ) ( z ( 1 - β ) + z ( α ^{'} ) ) ^{2}}{δ ^{2}}$

Which gives us the required minimum sample size equation. If we assume $σ_{A} = σ_{B}$ , as is often assumed because we do not know the variance of the treatment, then it simplifies to the following form (as seen in Ron Kohavi's paper).

$N \geq \frac{2 σ _{A}^{2} \cdot ( z ( 1 - β ) + z ( α ^{'} ) ) ^{2}}{δ ^{2}}$

Bernoulli Events

The equation (3) above for the minimum sample size requires us to know the standard deviation under the null and alternate hypotheses. Usually, the standard deviation under the null is computed from historical data, and it is assumed that $σ_{A} = σ_{B}$ . However, if the event we are interested in may be represented as a bernoulli random variable (e.g. an impression is shown and user either clicks or does not click with some probability), the equation may be simplified.

Specifically, the variance of a bernoulli random variable with probability $p$ is $p \cdot (1 - p)$ . Thus, if $X_{A} \sim B er n o u ll i (p_{A})$ , then $Va r (X_{A}) = p_{A} \cdot (1 - p_{A})$ , and likewise for $X_{B}$ .

So we can use $σ_{A} = p_{A} \cdot (1 - p_{A})$ and $σ_{B} = (p_{A} + δ) \cdot (1 - p_{A} - δ)$ and substitute these into equation (3). We will then be able to have a minimum sample size formula by just specifying $α$ , $β$ , baseline conversion $p_{A}$ and minimum relative difference $d$ . This is the formula used by Evan Miller's sample size calculator.

Imbalanced AB Test

Another common scenario is the case where we do not split 50-50, i.e. $N_{A} \neq = N_{B}$ . In this case, suppose we have $N_{A} = p \times N_{B} = p \times n$ , where $p > 1$ . For example, if we have a 90-10 split, then p=9. Then we get:

$δ δ n \geq (z (α^{'}) + z (1 - β)) \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}} \geq (z (α^{'}) + z (1 - β)) \frac{σ _{A}^{2} + p \cdot σ _{B}^{2}}{p n} \geq \frac{( z ( α ^{'} ) + z ( 1 - β ))}{δ ^{2}} (\frac{σ _{A}^{2}}{p} + σ_{B}^{2})$

Note that $n$ gives the sample size required for $N_{B}$ , and the total sample size required across both groups is $n (p + 1)$ .

Keyboard shortcuts

Chux's Notebook

Power Analysis

Bernoulli Events

Imbalanced AB Test