Today, I’ll present a neat little trick to construct a family of probability distribution which all look similar to a base distribution but which all have different means. And it isn’t !!
Once that’s done, I’ll show you in another post how to use this neat trick to prove a cool concentration result.
Let be some probability distribution. Let’s assume that “p has a Moment Generating Function (MGF)”: there is some interval around 0 such that:
Then, we simply define the natural exponential family of to be all the probability ditributions:
for all values of such that is finite. So it’s just but shifted around a little bit.
You already know several examples of natural exponential families: if p is a Gaussian, then its natural exponential family is all the other Gaussians with the same variance. If p is a Gamma distribution, its exponential family is also simple: try and see if you can figure it out.
First important property: all members of the family have a different mean. I’ll prove this in the next paragraph.
Second property: all “reachable” means are reached. If has a bounded support, then all the values inside the support correspond to a member of the family which has that mean. If the support is , then all values are reached. I don’t know a proof of that though, so take that with a grain of salt.
As a corollary to this, we can parameterize the family not with but with : this characterizes a single member of the family !
Relation to the cumulant generating function (CGF)
Since has a MGF, it also has a CGF:
Normally, when people talk about the derivatives of the MGF and the CGF, they only talk about the derivatives at , but when we talk about the exponential family, we can finally talk about the derivatives at other positions !
The k-th derivative of at some value of is, quite simply, the k-th cumulant of : one of the members of the exponential family !
Note that it’s not quite as a simple for the MGF : it’s derivatives need to be normalized by before we obtain a uncentered moment … If this doesn’t speak to you, let’s illustrate it with the second derivative of both: would you rather compute or ?
By looking at C, we can see why each member of the family has a different mean: since is convex, its derivative is a bijection, and its derivative is: . QED.
From the CGF, we can define its Fenchel dual or its convex conjugate , which is sometimes called the large-deviations function (which is a bad name if you want my opinion). This dual D is not easy to understand the first time you meet him, but he isn’t that complicated. C is convex, so that its first derivative is a bijection. D is also convex so its first derivative is also a bijection , and it happens to be that is exactly the inverse of !
Once we are equiped with , we are able to compute which member of the family has a given mean : it’s simply the guy who has . The value of is also interesting in itself, because it is equal to : it gives us the value of the normalization.
This duality between C and D is the parallel of the dual parameterizations with and ! This is why this is cool. When we want to know the properties of , we compute things on C. When we want to learn about , we compute things on D.
Divergence between members of the family
An important question is: how similar are two given members of the exponential family ? If we use the Kullback-Leibler divergence, it’s very easy to answer ! Try to prove that:
which is “obviously” positive: a convex function like C is always above its tangent at . This isn’t exactly the simplest expression for the divergence, but if we now compute the symmeterized KL divergence, we get:
As far as I know, this simple expression is absolutely useless, but it looks good, doesn’t it ?
Natural exponential families and MGF convergence
Since I’m still quite a bit perplexed by topologies on probability distributions, I can’t resist but talk about them in this post.
Recall that MGF convergence is when a sequence of random variables (or of probability distributions) has a corresponding sequence of MGF which converge pointwise. This implies weak-convergence of the sequence towards the probability distribution with the limit MGF, and also convergence of all moments (among other statistics).
But we can now see this implies quite a bit more still. Consider the sequence of exponential families (for the region where the MGF converge pointwise). For each value of , we get MGF convergence to the limit function . Pretty cool, no ?
Is it useful ?
The only real use I know for the exponential family is that it’s used for proving a concentration inequality for probability distributions which have a MGF. I’ll write another post about this since this one is getting a bit too long.
I also have some research of my own where this concept plays an important role, but I’m quite a bit behind on writing this down …
All in all, the natural exponential family is mostly as cool curiosity, but I hope you can find some use for it in your work.
As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.