Using natural exponential families: the Chernoff concentration result

In my last post, I introduced a neat little trick: how to define the natural exponential family of a probability distribution p(x). Let’s now see how we can use this trick to prove a powerful concentration result known as the Chernoff (or Cramer-Chernoff) theorem.

The intuition

You should already know the Chebyshev concentration result: a random variable who has finite variance has most of its probability around its mean. The basic idea behind it is that, by bounding some upper-moment of the random variable, we force most of its mass to be tightly packed.

Now, consider a distribution p(x) which has a finite moment generating function M(t) = E( \exp(tx) ). This represents a constraint on a statistic with a very fast growth rate, much faster than x^2, the statistic of the variance. This constraint should give us much tighter bounds on the probability distribution.

The math

Let’s fix E(x) = \mu_0, and let’s seek to bound p(x \geq \mu) for some \mu \geq \mu_0. Since we want to use the cool exponential families, let’s just stick a \exp(tx) term in there and see if we can work with that:

\displaystyle \int p(x) 1(x \geq \mu) dx = \\ M(t) \exp( - t\mu) \int \frac{p(x)\exp(tx)}{M(t)} \exp(-t(x-\mu)) 1(x\geq\mu) dx

Let’s take the terms in order. First the M(t) \exp( - t\mu) term. First, let’s make the cumulant generating function C(t) = \log(M(t)) appear:

\displaystyle M(t) \exp( - t\mu) = \exp( C(t) - t \mu )

To the trained eye, this looks good. Do you remember the dual D(\mu) of C(t) ? He is defined by D(\mu) = - \inf_t ( C(t) - t \mu). We thus have a cool lower bound for this first term:

\displaystyle M(t) \exp( - t\mu) \leq \exp( - D(\mu) )

Furthermore, this lower bound has a very nice link with the natural exponential family: it is reached for the value of t such that the p(x|t) member of the exponential family has mean \mu.

Let’s now work on the second term: \int \frac{p(x)\exp(tx)}{M(t)} \exp(-t(x-\mu)) 1(x\geq\mu) dx. We recognize an expected value under p(x|t) of the statistic \exp(-t(x-\mu)) 1(x\geq\mu). This statistic has values in [0,1] so we can upper-bound it by 1. Combining this with the earlier upper-bound, we get that:

\displaystyle p(x \geq \mu) \leq \exp( - D(\mu) )

This is a pretty nice bound: it tells us that most of the probability mass is in a very tight neighborhood around the mean. But let’s try to go further, and lower-bound p(x \geq \mu).

A lower-bound

The trick to deriving the lower-bound is to first fix the value t so that p(x|t) has mean \mu. Let’s note this value t^\star. We have:

\displaystyle p(x \geq \mu) = \exp( -D(\mu) ) \int p(x|t^\star) \exp( - t^\star (x - \mu)) 1(x \geq \mu) dx

Suppose p(x|t^\star) was extremely concentrated around his mean \mu, then the integral would basically be just p(x \geq \mu|t^\star) which we could roughly estimate to 0.5 if p(x) was also somewhat symmetric. We would then have the following lower-bound:

\displaystyle p(x \geq \mu) \geq 0.5 \exp( - D(\mu) )

Basically, we can make this intuitive argument solid. Noting \theta = - \log( \int p(x|t^\star) \exp( - t^\star (x - \mu)) 1(x \geq \mu) dx ), we can prove that:

\displaystyle \theta \leq p^{-1} t^\star \sqrt(var(x|t)) - \log(p)

where we have used the shorthand p = p(x \geq \mu | t^\star ).

And we can also prove that p is also lower-bounded (in fact, we could this in many different ways: the following bound is super rough but very general). If we note m_4 the fourth moment of p(x| t^\star) then:

\displaystyle p \geq \frac{var(x|t^\star)^2}{4 m_4}

This is a coarse bound, but it does the job: we can compute both the variance and the fourth moment from the derivatives of either M(t^\star) or C(t^\star).

Chernoff’s theorem

We can now finally state Chernoff’s theorem.

Thm: Let X_1, X_2, \dots X_n by IID random variables with mean \mu_0. Note D(\mu) the dual to their common cumulant generating function. Then \forall \mu \geq \mu_0:

\displaystyle n^{-1} \text{ }\log p( n^{-1} \sum_i X_i \geq \mu) \rightarrow - D(\mu)

Proof: The cumulant generating function of the sum C_n(t) respects: C_n(t) = n C(t/n), and t^\star also depends on n via: t_{n}^\star = n t^\star. Using that with the preceding relationships and some tedious math gives us the theorem (and proves that the error vanishes as O(n^{-0.5}) which isn’t the best speed but it’s ok).


We now understand the Chernoff concentration result. The upper-bound is a bit easy and tells us that a random variable which has a moment-generating function is tightly concentrated around its mean which we completely expected. The lower-bound is much more interesting in that it guarantees that large deviations can happen, and tells us exactly with what rate these deviations will happen. It’s very important to not only know this result but to understand exactly how it was derived because both the upper and lower-bound could be potentially be useful in other situations.

Such concentration inequalities are great for statistics, because they enable us to prove that our estimators are guaranteed not to be too wrong. However, I find that I am slightly unsatisfied by this bound in that regard because there are no cases in which the bound is naturally saturated, which I always find a bit troubling.

Two remarks

One thing that I’m a bit puzzled with is whether it’s actually the best to compute the lower-bound at t^\star. It’s not like we’re forced too. Could we possibly find a tighter bound by looking at some slightly different value of t ?

Second remark: did you notice how it would seem that I’ve written down a proof that, for \mu \rightarrow -\infty we would have p(x\geq\mu) \rightarrow 0 when it converges to 1 instead ? Don’t worry: the math is fine. This is simply because I never pointed out where exactly I used that $\mu \geq \mu_0$. Can you figure it out ?


As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.