In my last post, I introduced a neat little trick: how to define the natural exponential family of a probability distribution . Let’s now see how we can use this trick to prove a powerful concentration result known as the Chernoff (or Cramer-Chernoff) theorem.
You should already know the Chebyshev concentration result: a random variable who has finite variance has most of its probability around its mean. The basic idea behind it is that, by bounding some upper-moment of the random variable, we force most of its mass to be tightly packed.
Now, consider a distribution which has a finite moment generating function . This represents a constraint on a statistic with a very fast growth rate, much faster than , the statistic of the variance. This constraint should give us much tighter bounds on the probability distribution.
Let’s fix , and let’s seek to bound for some . Since we want to use the cool exponential families, let’s just stick a term in there and see if we can work with that:
Let’s take the terms in order. First the term. First, let’s make the cumulant generating function appear:
To the trained eye, this looks good. Do you remember the dual of ? He is defined by . We thus have a cool lower bound for this first term:
Furthermore, this lower bound has a very nice link with the natural exponential family: it is reached for the value of such that the member of the exponential family has mean .
Let’s now work on the second term: . We recognize an expected value under of the statistic . This statistic has values in so we can upper-bound it by . Combining this with the earlier upper-bound, we get that:
This is a pretty nice bound: it tells us that most of the probability mass is in a very tight neighborhood around the mean. But let’s try to go further, and lower-bound .
The trick to deriving the lower-bound is to first fix the value so that has mean . Let’s note this value . We have:
Suppose was extremely concentrated around his mean , then the integral would basically be just which we could roughly estimate to if was also somewhat symmetric. We would then have the following lower-bound:
Basically, we can make this intuitive argument solid. Noting , we can prove that:
where we have used the shorthand .
And we can also prove that is also lower-bounded (in fact, we could this in many different ways: the following bound is super rough but very general). If we note the fourth moment of then:
This is a coarse bound, but it does the job: we can compute both the variance and the fourth moment from the derivatives of either or .
We can now finally state Chernoff’s theorem.
Thm: Let by IID random variables with mean . Note the dual to their common cumulant generating function. Then :
Proof: The cumulant generating function of the sum respects: , and also depends on via: . Using that with the preceding relationships and some tedious math gives us the theorem (and proves that the error vanishes as which isn’t the best speed but it’s ok).
We now understand the Chernoff concentration result. The upper-bound is a bit easy and tells us that a random variable which has a moment-generating function is tightly concentrated around its mean which we completely expected. The lower-bound is much more interesting in that it guarantees that large deviations can happen, and tells us exactly with what rate these deviations will happen. It’s very important to not only know this result but to understand exactly how it was derived because both the upper and lower-bound could be potentially be useful in other situations.
Such concentration inequalities are great for statistics, because they enable us to prove that our estimators are guaranteed not to be too wrong. However, I find that I am slightly unsatisfied by this bound in that regard because there are no cases in which the bound is naturally saturated, which I always find a bit troubling.
One thing that I’m a bit puzzled with is whether it’s actually the best to compute the lower-bound at . It’s not like we’re forced too. Could we possibly find a tighter bound by looking at some slightly different value of ?
Second remark: did you notice how it would seem that I’ve written down a proof that, for we would have when it converges to 1 instead ? Don’t worry: the math is fine. This is simply because I never pointed out where exactly I used that $\mu \geq \mu_0$. Can you figure it out ?
As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.