The natural exponential family of p(x)

Today, I’ll present a neat little trick to construct a family of probability distribution which all look similar to a base distribution p(x) but which all have different means. And it isn’t p(x-\mu) !!

Once that’s done, I’ll show you in another post how to use this neat trick to prove a cool concentration result.


Let p(x) be some probability distribution. Let’s assume that “p has a Moment Generating Function (MGF)”: there is some interval around 0 such that:

\displaystyle M(t) = \int p(x) \exp( t x) dx

is finite.

Then, we simply define the natural exponential family of p(x) to be all the probability ditributions:

\displaystyle p(x|t) \propto p(x) \exp(t x)

for all values of t such that M(t) is finite. So it’s just p(x) but shifted around a little bit.

You already know several examples of natural exponential families: if p is a Gaussian, then its natural exponential family is all the other Gaussians with the same variance. If p is a Gamma distribution, its exponential family is also simple: try and see if you can figure it out.


First important property: all members of the family have a different mean. I’ll prove this in the next paragraph.

Second property: all “reachable” means are reached. If p(x) has a bounded support, then all the values inside the support correspond to a member of the family which has that mean. If the support is \mathbb R, then all values are reached. I don’t know a proof of that though, so take that with a grain of salt.

As a corollary to this, we can parameterize the family not with t but with \mu = E(x|t): this characterizes a single member of the family !

Relation to the cumulant generating function (CGF)

Since p(x) has a MGF, it also has a CGF:

\displaystyle C(t) = \log (M(t))

Normally, when people talk about the derivatives of the MGF and the CGF, they only talk about the derivatives at t=0, but when we talk about the exponential family, we can finally talk about the derivatives at other positions !

The k-th derivative of C(t) at some value of t is, quite simply, the k-th cumulant of p(x|t): one of the members of the exponential family !

Note that it’s not quite as a simple for the MGF M(t): it’s derivatives need to be normalized by M(t) before we obtain a uncentered moment … If this doesn’t speak to you, let’s illustrate it with the second derivative of both: would you rather compute var(x|t) or M(t) E(x^2|t) ?

By looking at C, we can see why each member of the family has a different mean: since C is convex, its derivative is a bijection, and its derivative is: C'(t) = E(x|t). QED.

From the CGF, we can define its Fenchel dual or its convex conjugate D(\mu), which is sometimes called the large-deviations function (which is a bad name if you want my opinion). This dual D is not easy to understand the first time you meet him, but he isn’t that complicated. C is convex, so that its first derivative C'(t) = E(x|t) is a bijection. D is also convex so its first derivative is also a bijection D'(\mu), and it happens to be that D' is exactly the inverse of C' !

Once we are equiped with D, we are able to compute which member of the family has a given mean \mu_0: it’s simply the guy who has t = D'(\mu_0). The value of D(\mu) is also interesting in itself, because it is equal to M(t): it gives us the value of the normalization.

This duality between C and D is the parallel of the dual parameterizations with t and \mu ! This is why this is cool. When we want to know the properties of p(x|t), we compute things on C. When we want to learn about p(x|\mu), we compute things on D.

Divergence between members of the family

An important question is: how similar are two given members of the exponential family ? If we use the Kullback-Leibler divergence, it’s very easy to answer ! Try to prove that:

\displaystyle KL(t_1,t_2) = C(t_2) - C(t_1) - C'(t_1) (t_2 - t_1)

which is “obviously” positive: a convex function like C is always above its tangent at t_2. This isn’t exactly the simplest expression for the divergence, but if we now compute the symmeterized KL divergence, we get:

\displaystyle KL(t_1,t_2) + KL(t_2,t_2) = [\mu_1-\mu_2][t_1 - t_2]

As far as I know, this simple expression is absolutely useless, but it looks good, doesn’t it ?

Natural exponential families and MGF convergence

Since I’m still quite a bit perplexed by topologies on probability distributions, I can’t resist but talk about them in this post.

Recall that MGF convergence is when a sequence of random variables (or of probability distributions) has a corresponding sequence of MGF which converge pointwise. This implies weak-convergence of the sequence towards the probability distribution with the limit MGF, and also convergence of all moments (among other statistics).

But we can now see this implies quite a bit more still. Consider the sequence of exponential families p_n(x|t) (for t \in [-r,r] the region where the MGF converge pointwise). For each value of t, we get MGF convergence to the limit function p(x|t). Pretty cool, no ?

Is it useful ?

The only real use I know for the exponential family is that it’s used for proving a concentration inequality for probability distributions which have a MGF. I’ll write another post about this since this one is getting a bit too long.

I also have some research of my own where this concept plays an important role, but I’m quite a bit behind on writing this down … SOON^{TM}

All in all, the natural exponential family is mostly as cool curiosity, but I hope you can find some use for it in your work.

As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.