This section is aimed at students in upper secondary education in the Danish
school system, some objects will be simplified some details will be omitted.
Confidence Intervals
Assume that there is a population of objects of which an unknown
proportion, \(p\in[0,1]\), has a specific attribute and you select a
number, \(n\in\mathbb{N}\), of these objects at random. Then the number of
objects in the sample with this attribute is a binomially distributed
random variable denoted as \(X\sim B(n,p)\). To estimate the unknown
proportion, we observe the number of objects in the sample with the
attribute, \(r\in[0,n]\cap\mathbb{N}\), and then we imagine the lowest and
highest proportions, \(p_-\) and \(p_+\), that would reasonably yield
these results. Here reasonable means that r lies in the acceptance region
of the associated binomial distribution with significance level \(\alpha\)
. Then we can construct the \(1-\alpha\) confidence interval as
\([p_-,p_+]\).
The Wilson 95% Confidence Interval Formula
For a sufficiently large sample we can estimate the 95% confidence
interval by Wilson's formula
$$p_\pm≈\frac{r+2}{n+4}\pm\frac{2}{n+4}\sqrt{\frac{r(n-r)}{n}+1}$$
This yields an interval symmetric about \(\frac{r+2}{n+4}\) which is not
intuitively the best estimate but this interval requires fewer assumptions
than some others and for large n and r, the values are approximately equa
to the intuitive \(\frac{r}{n}\).
Proof.
We start by considering the two binomial distributions \(X_\pm\sim
B(n,p_\pm)\) which can be approximated by the associated normal
distributions \(X_\pm\sim N(\mu,\sigma)\) since n is appropriately
large. Now the acceptance region is approximately symmetric about the
mean so we can look at the tails which split the significance level in
two. By the definition of \(p_\pm\), this means that r lies on the
boundaries of the accepted regions, i.e. \(P(X_- > r)=2.5\%\) and
\(P(X_+ < r)=2.5\%\).
\begin{align}
&&P(X_->r)=1-P(X_-< r)&=1-\Phi\left(\frac{r-\mu_-}{\sigma_-}\right)=
2.5\%\\
\implies&&\Phi\left(\frac{r-\mu_-}{\sigma_-}\right)&=1-2.5\%=97.5\%\\
\implies&&\frac{r-\mu_-}{\sigma_-}&=\Phi^{-1}(97.5\%)\approx 2
\end{align}
and
\begin{align}
&&P(X_+< r)&=\Phi\left(\frac{r-\mu_+}{\sigma_+}\right)=2.5\%\\
\implies&&\frac{r-\mu_+}{\sigma_+}&=\Phi^{-1}(2.5\%)\approx-2
\end{align}
Now we can take the inverse standard cumulative distribution function
(CDF) on both sides of the equations to yield
\begin{align}
&&\frac{r-\mu}{\sigma}=&\approx\pm2\\
\implies&&\frac{r-np}{\sqrt{np(1-p)}}&\approx\pm2\\
\implies&&r-np&\approx\pm2\sqrt{np(1-p)}\\
\implies&&(r-np)^2&\approx4np(1-p)\\
\implies&&r^2-2npr+n^2p^2&\approx4np-4np^2\\
\implies&&n(n+4)p^2-2n(r+2)p+r^2&\approx0\\
\end{align}
I dropped the \(\pm\) as soon as the formulas became identical and I
used the formulas for the mean and std for a binomial distribution. Now
we have a quadratic equation so we start by determining the coefficients
and the discriminant. \(a=n(n+4)\), \(b=-2n(r+2)\), \(c=r^2\) and
\begin{align}
\Delta&=b^2-4ac\\
&=4n^2(r+2)^2-4n(n+4)r^2\\
&=4n(\cancel{nr^2}+4nr+4n-\cancel{nr^2}-4r^2)\\
&=16n^2\left(\frac{r(n-r)}{n}+1\right)
\end{align}
Now the solutions will be
\begin{align}
p_\pm&\approx\frac{-b\pm\sqrt{\Delta}}{2a}\\
&\approx\frac{-(-2n(r+2))\pm\sqrt{16n^2\left(\frac{r(n-r)}{n}+1\right)
}}{2n(n+4)}\\
&\approx\frac{r+2}{n+4}\pm\frac{4n}{2n(n+4)}\sqrt{\frac{r(n-r)}{n}+1}
\end{align}
which trivially simplifies to Wilson's formula by reducing by 2n in the
second fraction.