Category Archives: probability

The Ballot Theorem

Suppose in an election with two candidates, we are certain of the final tally. However, as the votes are counted sequentially, the lead may switch between the candidates during the process. This was certainly the case in the 2020 U.S. presidential election, where in several swing states, different types of votes (mail, early-voting, in-person) were counted at different stages. While the final tally does not change in a recount, the lead change may. One may ask: How likely will there be lead changes in a recount? Can the winning candidate always lead from start to finish? The Ballot Theorem provides a quantitative answer to this question.

(Informal) Suppose candidates A and B each have a and b votes, respectively, with a>b. If the ballots are sorted uniformly at random and counted sequentially, the probability that A always leads B in this sequential count is \frac{a-b}{a+b}.

The theorem matches our intuition that if the election is close, i.e., the count difference a-b is small, then it is very likely that in a recount, B will be winning at some point, even though A will eventually overtake B‘s lead and win it all at the end. Let us formalize the statement in terms of random variables.

If X_{1}\ldots,X_{t} are random variables, \overline{X_{t}}:=\frac{1}{t}\sum_{i=1}^{t}X_{i} is the average of X_{1}\ldots,X_{t}.
(Ballot Theorem) Suppose X_{1}\ldots,X_{N}\in\{-1,1\} are i.i.d. random variables with \overline{X_{N}}=\rho, where \rho\in\mathbb{R}^{+}. Then

    \[ P\left(\forall t\in[N]\overline{X_{t}}>0|\overline{X_{N}}=\rho\right)=\rho. \]

To see that the formal statement captures the semantics of ballot counting, imagine that all N votes are in a box, and we randomly sample a ballot from a box without replacement. At time t, X_{t}=1 if the t-th sample is for Candidate A and -1 if the t-th sample is for Candidate B. Then \rho=\frac{a-b}{a+b}. In the rest of this post, we will prove the Ballot Theorem using two commonly used techniques in probability, the reflection principle and martingale.

Reflection Principle

Reflection Principle is a common technique used in Brownian motion, which is a continuous stochastic process that can model the movement of particles or the trajectory of stock prices over time. The idea is that if at any time, the stock price goes up or down equally likely, then any upward trajectory of the stock price has a corresponding downward trajectory (“reflected” along the x-axis). This symmetry allows us to reduce certain questions about a sequence of random variables to just one. For instance, the Ballot Theorem shows the probability that all N random variables \overline{X_{t}}_{t\in[N] are positive is reduced to the single constant random variable \overline{X_{N}}.

Before writing the proof formally, let us first describe the Ballot Theorem at a high level. Imagine the ballots are sorted, and we start counting the votes one by one. Suppose the first vote belongs to Candidate B. Since a>b, at some point during counting, B must relinquish its lead and tie with A, say with x votes each. Now pause the counting (temporarily!), and pair up each of the counted votes for A with a unique counted vote for B. With this pairing, we can re-create another recount by just swapping every counted A vote with its paired B vote. Then in this other recount, the first vote belongs to Candidate A and afer 2x votes, A is tied with B. In other words, whenever a recount has B losing its initial lead, there is another recount where A initially loses its lead as well!

Imagine plotting the vote counts. The scenario that B has an initial lead is represented by a curve that is below the x-axis and eventually hits the x-axis. The curve for the imagined recount described above is simply obtained by reflecting the original curve along the x-axis.

The probability that the first vote is for B and B relinquish this lead is \frac{b}{a+b} (since B does not have enough votes to win in the final tally). By reflection, the probability that the first vote is for A and A relinquishes this lead is also \frac{b}{a+b}. So the probability that A never falls behind is 1-\frac{2b}{a+b}=\frac{a-b}{a+b}.

Let T be the stopping time when \overline{X_{t}} first becomes 0, i.e.,
T=\begin{cases} \inf\{t\in[N]:\overline{X_{t}}=0\} & \text{if the set is non-empty}\\ N & \text{otherwise.} \end{cases}

Note that P\left(\forall t\in[N]\overline{X_{t}}>0|\overline{X_{N}}=\rho\right)=1-P\left(\overline{X_{T}}=0|\overline{X_{N}}=\rho\right), so it suffices to compute P\left(\overline{X_{T}}=0|\overline{X_{N}}=\rho\right). To this end, note that if \overline{X_{T}}=0, by the minimality of T, either \forall t<T,\overline{X_{t}}>0 or \forall t<T,\overline{X_{t}}<0.

Now here is the reflection principle at work.

P\left(\forall t<T,\overline{X_{t}}>0,\overline{X_{T}}=0|\overline{X_{N}}=\rho\right)=P\left(\forall t<T,\overline{X_{t}}<0,\overline{X_{T}}=0|\overline{X_{N}}=\rho\right).
(of Claim) Consider \omega\in\{-1,1\}^{N}such that \overline{X_{t}}(\omega)>0,\overline{X_{T}}(\omega)=0,\overline{X_{N}}(\omega)=\rho. Consider its reflection R(\omega)=(-\omega_{1},\ldots,-\omega_{T},\omega_{T+1},\ldots,\omega_{N}). Since the X_{i} are i.i.d. and \omega and R(\omega) are in one-to-one correspondence, the two probabilities are equal.

Again by the minimality of T,

    \[ P(\overline{X_{T}}=0|\overline{X_{N}}=\rho)=2P\left(\forall t<T,\overline{X_{t}}<0,\overline{X_{T}}=0|\overline{X_{N}}=\rho\right). \]

Note that P\left(\forall t<T,\overline{X_{t}}<0,\overline{X_{T}}=0|\overline{X_{N}}=\rho\right)=P\left(X_{1}=-1|\overline{X_{N}}=\rho\right). To see this, if \omega\in\{-1,1\}^{N} is such that \forall t<T,\overline{X_{t}}(\omega)<0,\overline{X_{T}}(\omega)=0,\overline{X_{N}}(\omega)=\rho, then X_{1}(\omega)=-1. Conversely, if X_{1}(\omega)=-1 and \overline{X_{N}}(\omega)=\rho>0, there must be some T such that \forall t<T,\overline{X_{t}}(\omega)<0,\overline{X_{T}}(\omega)=0.

Thus, putting the above together, we have

    \begin{eqnarray*} P\left(\forall t\in[N]\overline{X_{t}}>0|\overline{X_{N}}=\rho\right) & = & 1-P\left(\overline{X_{T}}=0|\overline{X_{N}}=\rho\right)\\  & = & 1-2P\left(\forall t<T,\overline{X_{t}}<0,\overline{X_{T}}=0|\overline{X_{N}}=\rho\right)\\  & = & 1-2P\left(X_{1}=-1|\overline{X_{N}}=\rho\right). \end{eqnarray*}

And finally, since the random variables \{X_{i}\}_{i\in[N]} are i.i.d.,

    \begin{eqnarray*} 1-2P\left(X_{1}=-1|\overline{X_{N}}=\rho\right) & = & \mathbb{E}\left[X_{1}|\overline{X_{N}}=\rho\right]\\  & = & \mathbb{E}\left[\overline{X_{N}}|\overline{X_{N}}=\rho\right]\\  & = & \rho, \end{eqnarray*}

finishing the proof.

The proof used a subtle property of the stopping time T. When A always leads B, T is defined to be N, and so \overline{X}_{T}>0 by assumption. At first glance, the definition that T=N if A always leads seems arbitrary. However, the semantic of stopping time requires that the event \{T=t\} must be “observable” at time t. The information that A always leads B is only determined at time N, so T must be defined to be N. In the language of measure theory, \{T=N\}\in\sigma(X_{1}\ldots,X_{N}). See the next section where stopping time is formally defined in terms of \sigma-algebra.


In a betting system, a gambler often formulates a strategy at time t based on previous observations, say the first t-1 card draws. For example, if we see a sequence of coin tosses being head, it is tempting to bet big on the 11th toss being a tail. A martingale is essentially a strategy where the expectation valuation at time t given the observations up to t-1 does not differ from previous valuation. Continuing the coin example, if the coin tosses are fair, then the expected winning on the 11th toss is the same as previous coin tosses.

A sequence of integrable random variables \{X_{t}\}_{t\in\mathbb{Z}} is a martingale if \mathbb{E}X_{t+1}|X_{1},\ldots,X_{t}=X_{t}.

At time t, a gambler may have access to other information computed from X_{1}\ldots,X_{t-1}, not just the values of these previous random variables. Thus, we use a larger set, a \sigma-algebra \mathcal{F}_{t} that semantically captures all the information available up to time t (as opposed to just \{X_{1},\ldots,X_{t}=x_{1},\ldots,x_{t}\}. More generally,

A sequence of integrable random variables \{X_{t}\}_{t\in\mathbb{Z}} is a martingale with respect to the \sigma-algebras \{\mathcal{F}_{t}\}_{t\in\mathbb{Z}} if

  • \{\mathcal{F}_{t}\}_{t\in\mathbb{Z}} is a filtration, i.e., \forall s<t ,\mathcal{F}_{s}\subset\mathcal{F}_{t},
  • X_{t} is \mathcal{F}_{t}-measurable,
  • \forall s<t ,\mathbb{E}X_{t}|\mathcal{F}_{s}=X_{s}.
  • Semantically, the first criterion just says that information at time t is cumulative: it includes information available previous time. For the second criterion, by definition \forall x\in\mathbb{R}, \{X_{t}\leq x\}\in\mathcal{F}_{t}, so it guarantees that \mathcal{F}_{t} actually captures the information at time t. Lastly, the third criterion enforces that the betting system is fair as discussed.

    In probability, many problems have a hidden martingale, and finding it often illuminates the problem more. Here is roughly how martingale is embedded inside the Ballot Theorem. Imagine we have randomly sorted N ballots for a recount. At time -t (the negative indices will be explained soon), we have t ballots uncounted. Since we knew the final tally, and we have just counted N-t ballots, we know how many of the remaining t ballots are for Candidates A and B, just not the specific order that they will come out. Thus, if we draw a random ballot from the remaining t ballots, its distribution is exactly \overline{X}_{-t}:=\frac{1}{t}\sum_{i=1}^{t}X_{i}, implying that \overline{X}_{-t} is a martingale (precise calculation within the proof below).

    We make the curious choice to index our sample average \overline{X}_{-t} with negative indices, even though we could have shifted the entire index sequence by N to the right. However, in theory, the number of random variables may be infinite (the setting for the Law of Large Numbers). In this case, when the average is taken over just one random variable or zero variable, the martingale stops. It is simply more instructive to let X_{-1} or X_{0} be the last random variable in this martingale. In fact, martingale with negative indices are also called backward martingale. Backward martingales are easier to work with as they have stronger convergence guarantees. The reason is that a backward martingale has a last random variable, which by definition is bounded and therefore bounds the entire sequence of random variables via the martingale property.

    We do not need the full machinery of backward martingale since in our application, N, the number of ballots, is bounded. But for pedagogical purposes we will retain the use of negative indices to remind ourselves that \overline{X}_{-t} is actually a backward martingale with stronger tools should we need to avail ourselves to them.

    Similar to the reflection principle, we will also make use of stopping time, but this time we will use it more extensively, so we define it carefully now.

    A random variable T:\Omega\rightarrow\mathbb{Z} is a stopping time with respect to \{\mathcal{F}_{t}\}_{t\in\mathbb{Z}} if for all t\in\mathbb{Z}, \{T=t\}\in\mathcal{F}_{t}, where \{\mathcal{F}_{t}\}_{t\in\mathbb{Z}} is a filtration.

    Semantically, when \mathcal{F}_{t} represents the cumulative information at time t, any stopping time criterion (bet after five heads are tossed, fold when money is below a threshold, etc) can only depend on the information available up to time t. Stopping time is often used in conjunction with martingales. Often it is useful to stop a sequence of random variables, say the first time that \overline{X}_{-t} hits 0. If the original sequence forms a martingale, then the sequence up to the stopping time — which itself is a random variable — remains a martingale if T is bounded. There are more general statements, but the following suffices for our purpose.

    (Martingale Optional Stopping Time) Let \{Y_{t}\}_{t=0}^{M} be a martingale and T:\Omega\rightarrow{0,1,\ldots,M} a stopping time with respect to the filtration \{\mathcal{F}_{t}\}_{t=0}^{M}, where M \in \mathbb{Z}^+. Then \mathbb{E}Y_{T}|\mathcal{F}_{0}=Y_{0}.
    By the definition of conditonal expectation, it suffices to show that

  • Y_{0} is \mathcal{F}_{0}-measurable and
  • for all A\in\mathcal{F}_{0}, \int_{A}Y_{T}=\int_{A}Y_{0}.
  • The first condition is satisfied by definition. For the second, by telescoping, we have

        \[ \int_{A}Y_{T}-Y_{0}=\sum_{t=0}^{M}\int_{A}Y_{\min(T,t+1)}-Y_{\min(T,t)}. \]

    For each term, we have

        \begin{eqnarray*} \int_{A}Y_{\min(T,t+1)}-Y_{\min(T,t)} & = & \int_{A\cap\{T\leq t\}}Y_{\min(T,t+1)}-Y_{\min(T,t)}+\\  & & \int_{A\cap\{T>t\}}Y_{\min(T,t+1)}-Y_{\min(T,t)}\\  & = & \int_{A\cap\{T\leq t\}}Y_{T}-Y_{T}+\int_{A\cap\{T>t\}}Y_{t+1}-Y_{t}\\  & = & 0, \end{eqnarray*}

    since A\cap\{T>t\}\in\mathcal{F}_{t} and \{Y_{t}\}_{t=0}^{M} is a martingale with respect to \{\mathcal{F}_{t}\}_{t=0}^{M}.

    (Of Ballot Theorem) For t\in[N], Define \overline{X}_{-t}=\frac{1}{t}\sum_{i=1}^{t}X_{i}. Let \mathcal{F}_{-t} denote the \sigma-algebra \sigma(\overline{X}_{-t},X_{t+1},\ldots,X_{N}). We first show that the sample averages \overline{X}_{-t} form a martingale.

    (Of Claim) Since the random variables X_{i} are i.i.d., we have

        \begin{eqnarray*} \mathbb{E}\overline{X}_{-(t-1)}|\mathcal{F}_{-t} & = & \frac{1}{t-1}\sum_{i=1}^{t-1}\mathbb{E}X_{i}|\overline{X_{-t}},X_{t+1},\ldots,X_{N}\\  & = & \frac{1}{t-1}\sum_{i=1}^{t-1}\mathbb{E}X_{i}|\overline{X_{-t}}\\  & = & \mathbb{E}X_{1}|\overline{X_{-t}}. \end{eqnarray*}

    On the other hand, by i.i.d. again, we have

        \begin{eqnarray*} \mathbb{E}X_{1}|\overline{X_{-t}} & = & \frac{1}{t}\sum_{i=1}^{t}X_{i}|\overline{X_{-t}}\\  & = & \mathbb{E}\overline{X_{-t}}|\overline{X_{-t}}\\  & = & \overline{X_{-t}}. \end{eqnarray*}

    Now let T be the stopping time when \overline{X_{-t}} first becomes 0, i.e.,
    T=\begin{cases} \inf\{t\in\{-N,\ldots,-1\}:\frac{1}{t}\sum_{i=1}^{t}X_{i}=0\} & \text{if the set is non-empty}\\ -1 & \text{otherwise.} \end{cases}

    Since T is a stopping time, in the else-case above, T must be -1 since \{T=-1\}\in\mathcal{F}_{-1}. Since \overline{X}_{-t} form a martingale, by the Martingale Optional Stopping Time Theorem,
    we have \overline{X_{-N}}=\mathbb{E}\overline{X_{T}}|\mathcal{F}_{-N}.

    Since \{\overline{X_{-N}}=\rho\}\subset\mathcal{F}_{-N}, by the Law of Total Expectation, we have

        \begin{eqnarray*} \mathbb{E}\left[\overline{X_{-N}}|\overline{X_{-N}}=\rho\right] & = & \mathbb{E}\left[\mathbb{E}\left[\overline{X_{T}}|\mathcal{F}_{-N}\right]|\overline{X_{-N}}=\rho\right]\\  & = & \mathbb{E}\left[\overline{X_{T}}|\overline{X_{-N}}=\rho\right]. \end{eqnarray*}

    Evaluating both sides, and writing \mathcal{E}=\{\forall t\in[N], \overline{X_{t}}>0\}, we have

        \begin{eqnarray*} \rho & = & \mathbb{E}\left[\overline{X_{T}}|\overline{X_{-N}}=\rho\right]\\  & = & P(\mathcal{E}|\overline{X_{-N}}=\rho)\mathbb{E}\left[\overline{X_{T}}|\overline{X_{-N}}=\rho,\mathcal{E}\right]+\\  & & P(\mathcal{E}^{c}|\overline{X_{-N}}=\rho)\mathbb{E}\left[\overline{X_{T}}|\overline{X_{-N}}=\rho,\mathcal{E}^{c}\right]\\  & = & P(\mathcal{E}|\overline{X_{-N}}=\rho)\mathbb{E}\left[\overline{X_{T}}|\overline{X_{-N}}=\rho,\mathcal{E}\right]\\  & = & P(\mathcal{E}|\overline{X_{-N}}=\rho)\mathbb{E}\left[\overline{X_{-1}}|\overline{X_{-N}}=\rho,\mathcal{E}\right]\\  & = & P(\mathcal{E}|\overline{X_{-N}}=\rho), \end{eqnarray*}

    where the third through fifth line follow from \overline{X_{T}}=0 if \mathcal{E}^{c}, the definition of T, and that X_{1}=1 if \mathcal{E}, respectively.

    Normal numbers

    Imagine if we sample numbers from the unit interval [0,1] and count the occurrences of each digit in every number’s infinite decimal representation. For instance, in the number 0.0123456789, every non-zero digit appears exactly once, but 0 appears infinitely often (trailing zeros). On the other hand, for the number 0.\overline{0123456789}, the digit occurrences (after the “dot”) is equal for every digit. As the number of samples increases, one may expect that the number of occurrences for each digit in \{0,1,\ldots,9\} to be close to one another. In fact, the Normal Number Theorem, first formulated by Borel, not only formalizes this intuition but goes a step further: for almost every number in [0,1], each digit appears equally likely. Below we discuss this theorem in detail.

    We first recall a few basic facts about the representation of a number. Rather than working entirely in base 10, we work with an arbitrary base b\geq 2.

    Let b\in\mathbb{N} with b\geq2. We say that x has a base-b representation if there is some M\in\mathbb{N} such that the series \sum_{m=0}^{M}z_{m}b^{m}+\sum_{n=1}^{\infty}x_{n}b^{-n} converges to x, with z_{m},x_{n}\in\mathbb{Z}_{b}.

    In other words, \sum_{m=0}^{M}x_{m}b^{m} is the integer component of x and \sum_{n=1}^{\infty}x_{n}b^{-n} the fractional component. With b=10, our familiar decimal representation, we know that every real number has a decimal representation, and by identifying 0.999... as 1, the representation is unique. Analogously, for every b\geq2, every real number has a base-b representation as well.

    (Normal numbers for base b) Let x\in\mathbb{R} with x=\sum_{m=1}^{M}z_{m}b^{m}+\sum_{n=1}^{\infty}x_{n}b^{-n}. We say that x is normal with respect to base b and digit d if \lim_{N\rightarrow\infty}\frac{1}{N}\#\{n:x_{n}=d,1\leq n\leq N\}=\frac{1}{b}, and x\in\mathbb{R} is normal with respect to base b if for every d\in\mathbb{Z}_{b}, it is normal with respect to base b and digit d.

    Note that in the definition above, z_{m}, the digits in the integer component of x, are completely ignored. The reason is that for every x\in\mathbb{R}, its integer component is bounded, so M in the above definition is fixed and does not affect the limiting value of digit frequencies.

    We say that x\in\mathbb{R} is normal if for every b\geq2, x is normal with respect to b.

    We now state the Normal Number Theorem formally:

    There is a null set E\subset\mathbb{R} such that every x\notin E, x is normal.
    The theorem is non-constructive — while one is almost guaranteed to find a normal number, it does not construct specific normal numbers. Furthermore, it is not known if famous irrational numbers such as \pi and e are normal.
    The power of the theorem lies in aggregation by 1) non-constructively examining all numbers at once instead of a individual number and 2) considering all digit precision instead of a specific precision. For instance, the frequency of the leading digit often obeys Benford’s Law, where the digits 1 and 2 in a decimal representation tend to appear more frequently than 8 and 9.

    Let us first consider a truncated version of the theorem. Consider tossing a fair coin independently N times. From a combinatorial argument, most sequences of coin tosses have about N/2 heads and N/2 tails. By interpreting a head as 1 and a tail as 0, most N-bit numbers are normal. The passage from N bits to infinite bits is the crux of the theorem, since the combinatorial process of selecting a random N-bit number no longer extends. The proof illustrates how a measure-theoretic abstraction can circumvent this difficulty. In particular, we will apply the following “\epsilon2^{-n}-trick” repeatedly.

    Let (\Omega,\mathcal{F},P) be any probability space. Let \{E_{n}\}_{n\in\mathbb{N}} be a countable collection of events E_{n}\in\mathcal{F} with each P(E_{n})=0. Then P(\cup_{n\in\mathbb{N}}E_{n})=0.
    Let \eps>0. By assumption, P(E_{n}) can be bounded above by \eps2^{-n}. Then

        \begin{eqnarray*}   P(\cup_{n\in\mathbb{N}}E_{n}) & \leq & \sum_{n=1}^{\infty}P(E_{n})\\                                 & \leq & \sum_{n=1}^{\infty}\eps2^{-n}\\                                 & \leq & \eps. \end{eqnarray*}

    Since \eps was arbitrary, P(\cup_{n\in\mathbb{N}}E_{n}) must equal 0.

    (Normal Number Theorem)
    First consider the interval [R,R+1], where R\in\mathbb{N}. By the previous proposition, it suffices to show that the set of non-normal numbers in [R,R+1] is null. Now fix b\geq2. By the proposition again, it suffices to show that the set of numbers in [R,R+1] that are not normal with respect to b is null. Now fix d\in\mathbb{Z}_{b}. By the proposition again, it suffices to show that the set of numbers in [R,R+1] that are not normal with respect to base b and digit d is null. By definition, for x\in[R,R+1], x is normal with respect to base b and digit d iff x-R is normal with respect to base b and digit d. Thus, without loss of generality, we may further assume that R=0.

    Now consider the probability space ([0,1],\mathcal{B},\mathbb{P}), where \mathcal{B} is the set of Borel sets on [0,1], and \mathbb{P} the Lebesgue (probability) measure. For n\in\mathbb{N}, define X_{n}:[0,1]\rightarrow\{0,1\} to be X_{n}(x)=\mathbb{I}_{\{x_{n}=d\}}, where x_{n} is the n-th digit of the base b expansion x=\sum_{n=1}^{\infty}x_{n}b^{-n}.

    \{X_{n}\}_{n\in\mathbb{N}} is a sequence of i.i.d. Bernoulli random variable with probability \frac{1}{b}.
    (of Claim)
    As defined, X_{n} is an indicator function, and for x_{n}\in\{0,1\}, the set \{x:X_{n}(x)=x_{n}\} is just a union of intervals:

        \[ \bigcup_{x_{1}\ldots,x_{n-1}\in\mathbb{Z}_{b}}\left[\sum_{i=1}^{n}x_{i}b^{-i},\sum_{i=1}^{n}x_{i}b^{-i}+b^{-n}\right), \]

    which is a Borel set, implying that X_{n} is a Bernoulli random variable with \mathbb{P}(X_{n}=x_{n})=b^{n-1}\cdot b^{-n}=\frac{1}{b}.

    To show independence, it suffices to show that for every N\in\mathbb{N}, the family of random variables \{X_{n}\}_{n=1}^{N} is independent. This follows directly from the observation that \mathbb{P}(X_{1}=x_{1},\ldots,X_{N}=x_{N})=b^{-N} (since the interval [\sum_{n=1}^{N}x_{n}b^{-n},\sum_{n=1}^{N}x_{n}b^{-n}+b^{-N}) has length \frac{1}{b^{N}}) and the previous observation that \mathbb{P}(X_{n}=x_{n})=\frac{1}{b}.

    Given our setup, we now apply the Strong Law of Large Numbers to conclude: there exists a null set E so that for every x\not\notin E, \lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n=1}^{N}X_{n}(x) converges to E[X_{1}(x)]=\frac{1}{b}, i.e., x is normal with respect to base b and digit d.

    As a corollary, we conclude that for every continuous random variable, its output is most likely normal.

    Let (\Omega,\mathcal{F},P) be a probability space, and X:\Omega\rightarrow \mathbb{R} a continuous random variable with a probability density function. Then P(X\in E)=0, where E is the set of non-normal numbers.
    By the Normal Number Theorem, the set of non-normal numbers, denoted E, is null, so for every n, there is some open set V_{n}\subset\mathbb{R} with P(V_{n})\leq1/n. Let f_X be the probability density function of X, then

        \begin{eqnarray*} P(X\in E) & \leq & P(X\in V_{n})\\ & = & \int f_{X}(x)\mathbb{I}_{V_{n}}(x)dx. \end{eqnarray*}

    By the Dominated Convergence Theorem,

        \[ \lim_{n\rightarrow\infty}\int f_{X}(x)\mathbb{I}_{V_{n}}(x)dx=\int f_{X}(x)\lim_{n\rightarrow\infty}\mathbb{I}_{V_{n}}(x)dx=0, \]

    finishing the proof.

    A continuous random variable X having a probability density function is equivalent to the cumulative distribution function F_{X} of X being absolutely continuous. (And recall the Cantor function as an example where X may not even have a probability density function!) Alternatively, the above corollary also follows by observing that E can be approximated by a finite number of small intervals in \mathbb{R} (e.g., Littlewood’s First Principle as discussed in a previous blog post), which F_{X} being absolutely continuous must map into another null set.