Back

This section is aimed at students in upper secondary education in the Danish school system, some objects will be simplified some details will be omitted.

Confidence Intervals

Assume that there is a population of objects of which an unknown proportion, $p\in[0,1]$, has a specific attribute and you select a number, $n\in\mathbb{N}$, of these objects at random. Then the number of objects in the sample with this attribute is a binomially distributed random variable denoted as $X\sim B(n,p)$. To estimate the unknown proportion, we observe the number of objects in the sample with the attribute, $r\in[0,n]\cap\mathbb{N}$, and then we imagine the lowest and highest proportions, $p_-$ and $p_+$, that would reasonably yield these results. Here reasonable means that r lies in the acceptance region of the associated binomial distribution with significance level $\alpha$ . Then we can construct the $1-\alpha$ confidence interval as $[p_-,p_+]$.

The Wilson 95% Confidence Interval Formula

For a sufficiently large sample we can estimate the 95% confidence interval by Wilson's formula $$p_\pm≈\frac{r+2}{n+4}\pm\frac{2}{n+4}\sqrt{\frac{r(n-r)}{n}+1}$$

This yields an interval symmetric about $\frac{r+2}{n+4}$ which is not intuitively the best estimate but this interval requires fewer assumptions than some others and for large n and r, the values are approximately equa to the intuitive $\frac{r}{n}$.

Proof.

We start by considering the two binomial distributions $X_\pm\sim B(n,p_\pm)$ which can be approximated by the associated normal distributions $X_\pm\sim N(\mu,\sigma)$ since n is appropriately large. Now the acceptance region is approximately symmetric about the mean so we can look at the tails which split the significance level in two. By the definition of $p_\pm$, this means that r lies on the boundaries of the accepted regions, i.e. $P(X_- > r)=2.5\%$ and $P(X_+ < r)=2.5\%$. \begin{align} &&P(X_->r)=1-P(X_-< r)&=1-\Phi\left(\frac{r-\mu_-}{\sigma_-}\right)= 2.5\%\\ \implies&&\Phi\left(\frac{r-\mu_-}{\sigma_-}\right)&=1-2.5\%=97.5\%\\ \implies&&\frac{r-\mu_-}{\sigma_-}&=\Phi^{-1}(97.5\%)\approx 2 \end{align} and \begin{align} &&P(X_+< r)&=\Phi\left(\frac{r-\mu_+}{\sigma_+}\right)=2.5\%\\ \implies&&\frac{r-\mu_+}{\sigma_+}&=\Phi^{-1}(2.5\%)\approx-2 \end{align} Now we can take the inverse standard cumulative distribution function (CDF) on both sides of the equations to yield \begin{align} &&\frac{r-\mu}{\sigma}=&\approx\pm2\\ \implies&&\frac{r-np}{\sqrt{np(1-p)}}&\approx\pm2\\ \implies&&r-np&\approx\pm2\sqrt{np(1-p)}\\ \implies&&(r-np)^2&\approx4np(1-p)\\ \implies&&r^2-2npr+n^2p^2&\approx4np-4np^2\\ \implies&&n(n+4)p^2-2n(r+2)p+r^2&\approx0\\ \end{align} I dropped the $\pm$ as soon as the formulas became identical and I used the formulas for the mean and std for a binomial distribution. Now we have a quadratic equation so we start by determining the coefficients and the discriminant. $a=n(n+4)$, $b=-2n(r+2)$, $c=r^2$ and \begin{align} \Delta&=b^2-4ac\\ &=4n^2(r+2)^2-4n(n+4)r^2\\ &=4n(\cancel{nr^2}+4nr+4n-\cancel{nr^2}-4r^2)\\ &=16n^2\left(\frac{r(n-r)}{n}+1\right) \end{align} Now the solutions will be \begin{align} p_\pm&\approx\frac{-b\pm\sqrt{\Delta}}{2a}\\ &\approx\frac{-(-2n(r+2))\pm\sqrt{16n^2\left(\frac{r(n-r)}{n}+1\right) }}{2n(n+4)}\\ &\approx\frac{r+2}{n+4}\pm\frac{4n}{2n(n+4)}\sqrt{\frac{r(n-r)}{n}+1} \end{align} which trivially simplifies to Wilson's formula by reducing by 2n in the second fraction.

∎