1. A trivial example: flipping a coin¶

Question:

A coin is tossed for 100 times, heads appeared 65 times, tails appeared 35 times. Construct a hypothesis test to check whether the coin is fair. ($P_{head}=P_{tail}=\frac{1}{2}$)

Hypothesis: $$H_0:P_{h}=0.5, H_a:P_{h}\neq 0.5$$

(1) Z test¶

Since for binomial distribution, each trial is either 0 or 1, so the sample mean is also the proportion of success. Since for this experiment, we already know the distribution of tossing a coin is a binomial distribution, let $X$ to be the count of successes, then: $$\mu_X = np$$ $$\sigma^2_X=np(1-p)$$ $$X \sim N(np, np(1-p))$$

the SD (Standard Deviation) $\sigma$ is known, we need compute SE (Standard Error) for mean. The sample mean actually is the sample proportion of successes(count of successes per bernoulli trial), $\hat{p}=\frac{X}{n}$, so

$$\mu_\hat{p}=np/n=p$$$$\sigma^2_\hat{p} = var(\hat{p}) = var(\frac{X}{n}) = \frac{var(X)}{n^2} = \frac{p(1-p)}{n}$$$$\hat{p} \sim N(p, \frac{p(1-p)}{n})$$

Therefore, SE for the mean(sample proportion) is $\sigma_\hat{p}=\sqrt{p(1-p)/n}$.

As the $\sigma$ is known, we can use Z test.

Z statistic:

In the following formulas, we just use $\bar{X}$ to represent $\hat{p}$, which is the sample proportion of successes (binomial proportion of successes is equivalent to bernoulli sample mean). Must note here $\bar{X} \neq \frac{\sum{X_i}}{n}$. as we already defined $X$ is the count of successes of a binomial distribution. We don't have $X_i$ here.

$$z = \frac{\bar{X}-\mu}{SE}=\frac{\bar{X}-\mu}{\sigma_\hat{p}}=\frac{\bar{X}-p}{\sqrt{p(1-p)/n}}=\frac{0.65-0.5}{\sqrt{0.5*0.5/100}}=3$$

In [1]:

# compute p-value
X_bar <- 0.65
mean <- 0.5
len <- 100
p <- 0.5
z <- (X_bar - mean) / sqrt(p*(1-p) /len)
p_value <- 1 - pnorm(z)
p_value*2

0.00269979606326021

For a two-tailed Z test, P value is smaller than 0.05, so it rejected the null hypothesis.

(2) Binomial test¶

From the experiment, we know $X\sim Binomial(100, 0.5)$. P-value is the probability that we see as extreme as or more extreme than the observed one. So we can compute the probability that $X$ is greater than or equal to observation. $$P(X>=70)=\sum_{k=70}^{100} p^k(1-p)^{10-k}$$

There are multiple ways to implement binomial test in R. We tried three of them.

In [2]:

# method 1: use binom.test
binom.test(x=65, n=100, p=0.5, alternative = c('two.sided'), conf.level = 0.95)

# method 2: compute the probability >= observation by hand
P <- 0
for (k in seq(65, 100)){
  P <- P + dbinom(k, len, prob=p)
}
P*2

# method 3: compute cdf of binomial directly
2*pbinom(100-65, size = len, prob = p) # same as (1-pbinom(64, size = len, prob = p))*2

	Exact binomial test

data:  65 and 100
number of successes = 65, number of trials = 100, p-value = 0.003518
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5481506 0.7427062
sample estimates:
probability of success 
                  0.65

0.00351764172297016

The P-value computed by three methods are consistent. We rejected the null hypothesis.

Normal Approximation to Binomial:

According to Central Limit Theorem, as each toss is a bernoulli trial, so the sum/mean of the sample has a limiting normal distribution. That is, we can use normal distribution to approximate binomial distribution (sum of bernoulli) when n is large. As we showed below, the p-value computed from normal distribution is quite close to binomial distribution. (Note we did a "continuity correction" to make the approximation more precise, see here)

In [3]:

# normal approximation
X_bar <- 65-0.5 # continuity correction 
mean <- 50
len <- 100
p <- 0.5
z <- (X_bar - mean) / sqrt(p*(1-p) *len)
p_value <- 1 - pnorm(z)
p_value*2

0.00373162660076809

(3) Distribution of number of successes VS. Distribution of proportion of successes.¶

We have mentioned this concept in the first part, but those two concepts are easy to get confused. We need more discussion to make it clear.

Prerequisite knowledge: Central Limit Theorem

The Central Limit Theorem (CLT) implies a sample of independent random variables, their sums tends towards to a normal distribution even if the original variables themselves aren't normally distributed, also the sample mean tends towards to a normal distribution (sum and mean are equivalent).

For the sample in this problem, each variable is bernoulli distributed, so the sums is binomial distributed.

Distribution of number of successes:

If we consider the sum of random variable in a sample, according to CLT, when n increases, the sum approximates a normal distribution, $$X \sim N(np, np(1-p))$$

Distribution of proportion of successes:

If we consider the mean of random variable in a sample, according to CLT, when n increases, the mean approximates a normal distribution, $$\bar{X} \sim N(\frac{np}{p}, \frac{np(1-p)}{n^2}) = N(n, \frac{p(1-p)}{n})$$

Another Hypothesis test:

If the coin is fair, then the expected number of heads in 100 tosses is 50. Given the null hypothesis (coin is fair, or expectation of successes is 50), we want to test whether our observation (65 heads in 100 tosses) is significant different from the expectation of number of heads ($\mu=100\times 0.5=50$). Use the sum of successes $X$ to construct the test statistics, then $H_0:X=50, H_a:X\neq 50$ $$\frac{X-\mu}{SD}=\frac{X-np}{\sqrt{np(1-p)}} = \frac{650-500}{\sqrt{100*0.5*(1-0.5)}} = 3$$ This is equivalent as the test statistics constructed by proportion of successes.

2. A/B testing: Ads Click Through Rate¶

Question:

Two Ads, Ad one has 1000 impressions and 20 clicks, CTR is 2%; Ad two has 900 impressions and 30 clicks, CTR is 3.3%. Test whether there is difference between Click Through Rate (CTR) between Ad one and two.

t test for two population:

Similar to the first question, we know sample mean of each experiment is the CTR we need compare. According to CTR, sample mean is limiting normal distributed. Then we have $$\bar{X}_1 \sim N(p_1, p_1(1-p_1)/n_1)$$ $$\bar{X}_2 \sim N(p_2, p_2(1-p_2)/n_2)$$ $$\bar{X}_1 - \bar{X}_2 \sim N(p_1-p_2, \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2})$$

Hypothesis: $$H_0:\bar{X}_1 - \bar{X}_2 = 0, H_a:\bar{X}_1 - \bar{X}_2 \neq 0$$

t test statistic: $$ t = \frac{\bar{X}_1 - \bar{X}_2 - 0}{SE} = \frac{\bar{X}_1 - \bar{X}_2 - 0} {\sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}}$$

Since standard deviation of the sample $S_1=\sqrt{p_1(1-p_1)}$ is different from $S_2=\sqrt{p_2(1-p_2)}$, we use the unpooled standard error t-test (usually one SE should twice more than another SE, but in this case they're not that different). For unpooled t-test, the degree of freedom we choose $min(n_1-1, n_2-2)$. For more details of pooled and unpooled standard error method, check the notes of PSU here.

In [4]:

n1 <- 1000; n2<- 900; click1 <- 20; click2 <- 30
x1 <- click1/n1; x2 <- click2/n2
p1 <- x1; p2 <- x2
SE <- sqrt( p1*(1-p1)/n1 + p2*(1-p2)/n2 )
t <- abs((x1 - x2)) / SE
p_value <- 1-pt(t, df = min(n1 -1, n2 -1) )
p_value * 2

0.0735775780571641

For a two-tailed t test, P-value is greater than 0.05, fail to reject null hypothesis. However, if we have prior knowledge and know Ad two could be no worse than Ad one. Then we can modify the hypothesis and check if the CTR of Ad two is great than Ad one.

Hypothesis for one-tailed test: $$H_0:\bar{X}_1 - \bar{X}_2 = 0, H_a:\bar{X}_2 - \bar{X}_1 > 0$$

This is a one-tailed t test. The P-value is half of the previous one and less than 0.05. We reject the null hypothesis and say we're 95% condifent Ad 2 has higher CTR than Ad 1. However, we must know choosing one-tailed test after running a two-tailed test that failed to reject the null hypothesis is wrong. For when a one-tailed test is appropriate, see here.

Why use t-test rather than Z-test?

We use t-test instead of Z-test since we don't know the standard deviation of the population. So we use the standard deviation of sample to replace the unknown standard deviation of the population, then t-test should be implemented. Notice this is different from the first example where observation mean is compared with expected mean. As there is only one population, assume the null hypothesis is true then standard deviation of population can be computed.

(Note: Binomial distribution is different from normal distribution, for bernoulli, we can compute SD if we know mean, but the normal distribution, SD is still unknown even if mean is known. So for a normal sample, we still need t-test. e.g. given a sample of male weight, test whether the mean equals to 70kg, use t-test since $\sigma$ is unknown.)

3. A/B testing continue: Chi-square $\chi^2$ test¶

Chi-square test can be used for two purposed:

Goodness of fit test. When you have a categorical variable from a single population, test whether sample data are consistent with a hypothesis. For explanation and example, see here
Test for independence. When you have two categorical variables from a single population, test whether there is a significant association between the two variables. For explanation and example, see here and here

Let's see the Ads click exmaple again. It can be regarded as two categorical variables, Ad 1 vs. Ad 2, Clicks vs. NonClicks. The original table and table with expected value are showed below.

In [5]:

library(knitr)
Clicks <- c(20, 30)
Impressions <- c(1000, 900)
NonClicks <- Impressions - Clicks
ad <- data.frame(Clicks, NonClicks, Impressions)
ad <- rbind(ad, 'Column total' = c(50, 1850, 1900))
row.names(ad) <- c('Ad 1', 'Ad 2', 'Column Total')
kable(ad, align = 'l')

# compute the mean and chi-square score with the online tool:
# http://www.socscistatistics.com/tests/chisquare/Default2.aspx
ad2 <- ad
ad2[1, 1] <- '20   (26.32)   [1.52]'
ad2[1, 2] <- '980   (973.68)   [0.04]'
ad2[2, 1] <- '30   (23.68)   [1.68]'
ad2[2, 2] <- '870   (876.32)   [0.05]'
kable(ad2, align = 'l')


|             |Clicks |NonClicks |Impressions |
|:------------|:------|:---------|:-----------|
|Ad 1         |20     |980       |1000        |
|Ad 2         |30     |870       |900         |
|Column Total |50     |1850      |1900        |


|             |Clicks                |NonClicks               |Impressions |
|:------------|:---------------------|:-----------------------|:-----------|
|Ad 1         |20   (26.32)   [1.52] |980   (973.68)   [0.04] |1000        |
|Ad 2         |30   (23.68)   [1.68] |870   (876.32)   [0.05] |900         |
|Column Total |50                    |1850                    |1900        |

Hypothesis:

$H_0:$ Variable Ad type and variable Whether click are independent

$H_a:$ Variable Ad type and variable Whether click are not independent

$\chi^2$ statistic:

$$\chi^2 = \sum \frac{(O_i - E_i)^2} {E_i} $$$$= \frac{(20-26.32)^2}{26.32} + \frac{(980-973.68)^2}{973.68} + \frac{(30-23.68)^2}{23.68} + \frac{(870-876.32)^2}{876.32} = 3.2865$$

In [6]:

# method 1: compute chi_score and p-value 
chi_score <- 3.2865
df <- (2-1)*(2-1)
1 - pchisq(3.2865, df=df)

# method 2: use chisq.test directly
chisq.test(ad[1:2, 1:2], correct = FALSE)

0.0698517737261586

	Pearson's Chi-squared test

data:  ad[1:2, 1:2]
X-squared = 3.2865, df = 1, p-value = 0.06985

P-value is great than 0.05, fail to reject the null hypothesis. That is, variable Whether click and Ad type has no relationship, which means we can't say different Ad type has different proportion of clicks (CTR of different Ad type are different). Chi-square test got the consistent result with t-test.

Comparison:

In t-test, we can test the difference of two sample to see whether it's different from zero. However, if we have more than two samples from different population, Chi-square can still be implemented. For example, suppose there are three Ads, 1, 2 and 3. To Chi-square, there are two categorical variables as well, Ad type and Whether Click, we can still check whether different Ad type yield different proportion of clicks.

(Question: How to interpret the result if there is indeed difference, like can you conclude which Ad has higher CTR?)

(To Be Continued)

Understanding A/B Testing and Statistics Behind