This post is about trying to solve the following brain-teaser:

You toss a fair coin 400 times. What’s the probability that you get at least 220 heads? Round your answer to the nearest percent.

Each coin toss is an independent Bernoulli random variable parametrised by $p=\frac{1}{2}$ since the coin is fair. We’ll call a single sequence of 400 tosses a single trial. It is a Binomial random variable with the probability mass function (PMF)

This means, for example, that the probability of scoring exactly 220 heads is

The final form of this expression can be interpreted as a number of ways to choose 220 heads from 400 tosses in the numerator over the total number of possible configurations in the denominator.

Herein comes the trouble. We would like to know the probability of the number of heads being $\ge 220$, not just exactly 220. In other words, we seek to compute the sum

There is a handful of approaches one can take to solve this problem.

1. Run $1\,000\,000$ trials, each of $400$ tosses. Count how many trials scored $\ge 220$ heads. This is the most direct approach, but how does one justify the chosen number of trials? This approach is also most computationally expensive, but as we will see it has its values as it is very easy to implement and it provides a good estimate for comparison with other methods.

2. The sum $\sum^{400}_{k=220} p_x(k)$ can be computed directly by iteration.

3. A classic pencil and paper approach could use a normal approximation based on the central limit theorem. Statistical tables will be required to avoid any computer or calculator use. [One could avoid using tables and compute everything manually, for example by using Taylor expansions of $e^x$ but we won’t go there.]

# Option 1: Simulation

The simulation does not require much understanding of the probabilistic model behind the problem. We just run $1\,000\,000$ trials, $400$ tosses in each, and find out how many of the trials scored more than $220$ heads.

The first simulation is implemented in R:

I get the value of $0.03$, so the answer is $3\%$.

For no good reason at all, the same simulation crudely implemented in C++11:

It runs slower than R version above, and a representative answer is $\approx 0.025317$, that is about $3\%$.

# Option 2: Sum by Iteration

For a more formal approach, consider the sum

Since the Bernoulli probability for each toss $p=0.5$ is fixed, the sum of probabilities can be reduced to the sum of binomial coefficients

This sum is easy to do on a computer.

This code returns the value of $0.03$, or $3\%$. This result was computed much, much faster than using simulation, but the important thing here is that it’s the same answer obtained using different methods.

The C++ implementation would have to include a way to compute $\binom{n}{k}$ since the standard library does not provide for it. Implementing it is a good exercise, since one cannot rely on the mathematical definition involving factorials – the result would overflow even for rather small values of $n$. (Knuth, 2011) has a lot more to say on this.

# Option 3: Normal Approximation Based on the Central Limit Theorem

Since each coin toss $X_i$ is an independent Bernoulli rendom variable and all tosses in a trial are identically distributed with mean $\mu = \frac{1}{2}$ and variance $\sigma^2 = \frac{1}{4}$, we can try to pretend that the sum $S_n = \sum^{400}_{i=1} X_i$ of such variables is normally distributed.

Then we can compute the mean of this sum

And also its variance

Now we could use the approximation

where $\Phi\left(z\right)$ is available from standard normal CDF tables.

If you think of it in terms of area under the graph, we would like to know the area of a slice between 220 and 400, or in other words the difference in the value of the CDF at these locations.

Two points must be noted. Firstly, $P\left( S_n \le 400 \right) = 1$, since there is only $n=400$ tosses in a trial. Secondly, we have to be careful when moving from discrete to continuous world – we would like to include $P\left( S_n = 220 \right)$ into our answer, so the last (discrete) value to exclude is $S_n = 219$.

The CDF tables are normalised, so let’s normalise the value of $z$

My tables (Lindley D.V. & Scott W.F., 2008) state that

Rounding this value to the nearest percent, the final answer is $3\%$, as before.

The normal approximation works so well because, to quote (Bersekas & Tsitsiklis, 2008, p. 279),

When $p$ is close to $\frac{1}{2}$, in which case the PMF of $X_i$ is symmetric, the […] formula yields a very good approximation for $n$ as low as $40$ or $50$.

This was pencil and paper (and statistical tables) approach. We can do the same thing in software to verify the answer.

This code returns the value of $0.03$, or $3\%$, same as before.

There we have it!

# Conclusions and Meditations

Despite my education in mathematical modelling, I still get excited when a crude simulation turns out to be helpful for guiding more theoretical work. Such was the case this time as well – the simple simulation as the first option has prevented me from making an order-of-magnitude sized error in my theoretical derivations later.