# The natural exponential family of p(x)

Today, I’ll present a neat little trick to construct a family of probability distribution which all look similar to a base distribution $p(x)$ but which all have different means. And it isn’t $p(x-\mu)$ !!

Once that’s done, I’ll show you in another post how to use this neat trick to prove a cool concentration result.

## Definition

Let $p(x)$ be some probability distribution. Let’s assume that “p has a Moment Generating Function (MGF)”: there is some interval around 0 such that:

$\displaystyle M(t) = \int p(x) \exp( t x) dx$

is finite.

Then, we simply define the natural exponential family of $p(x)$ to be all the probability ditributions:

$\displaystyle p(x|t) \propto p(x) \exp(t x)$

for all values of $t$ such that $M(t)$ is finite. So it’s just $p(x)$ but shifted around a little bit.

You already know several examples of natural exponential families: if p is a Gaussian, then its natural exponential family is all the other Gaussians with the same variance. If p is a Gamma distribution, its exponential family is also simple: try and see if you can figure it out.

### Properties

First important property: all members of the family have a different mean. I’ll prove this in the next paragraph.

Second property: all “reachable” means are reached. If $p(x)$ has a bounded support, then all the values inside the support correspond to a member of the family which has that mean. If the support is $\mathbb R$, then all values are reached. I don’t know a proof of that though, so take that with a grain of salt.

As a corollary to this, we can parameterize the family not with $t$ but with $\mu = E(x|t)$: this characterizes a single member of the family !

### Relation to the cumulant generating function (CGF)

Since $p(x)$ has a MGF, it also has a CGF:

$\displaystyle C(t) = \log (M(t))$

Normally, when people talk about the derivatives of the MGF and the CGF, they only talk about the derivatives at $t=0$, but when we talk about the exponential family, we can finally talk about the derivatives at other positions !

The k-th derivative of $C(t)$ at some value of $t$ is, quite simply, the k-th cumulant of $p(x|t)$: one of the members of the exponential family !

Note that it’s not quite as a simple for the MGF $M(t)$: it’s derivatives need to be normalized by $M(t)$ before we obtain a uncentered moment … If this doesn’t speak to you, let’s illustrate it with the second derivative of both: would you rather compute $var(x|t)$ or $M(t) E(x^2|t)$ ?

By looking at C, we can see why each member of the family has a different mean: since $C$ is convex, its derivative is a bijection, and its derivative is: $C'(t) = E(x|t)$. QED.

From the CGF, we can define its Fenchel dual or its convex conjugate $D(\mu)$, which is sometimes called the large-deviations function (which is a bad name if you want my opinion). This dual D is not easy to understand the first time you meet him, but he isn’t that complicated. C is convex, so that its first derivative $C'(t) = E(x|t)$ is a bijection. D is also convex so its first derivative is also a bijection $D'(\mu)$, and it happens to be that $D'$ is exactly the inverse of $C'$ !

Once we are equiped with $D$, we are able to compute which member of the family has a given mean $\mu_0$: it’s simply the guy who has $t = D'(\mu_0)$. The value of $D(\mu)$ is also interesting in itself, because it is equal to $M(t)$: it gives us the value of the normalization.

This duality between C and D is the parallel of the dual parameterizations with $t$ and $\mu$ ! This is why this is cool. When we want to know the properties of $p(x|t)$, we compute things on C. When we want to learn about $p(x|\mu)$, we compute things on D.

### Divergence between members of the family

An important question is: how similar are two given members of the exponential family ? If we use the Kullback-Leibler divergence, it’s very easy to answer ! Try to prove that:

$\displaystyle KL(t_1,t_2) = C(t_2) - C(t_1) - C'(t_1) (t_2 - t_1)$

which is “obviously” positive: a convex function like C is always above its tangent at $t_2$. This isn’t exactly the simplest expression for the divergence, but if we now compute the symmeterized KL divergence, we get:

$\displaystyle KL(t_1,t_2) + KL(t_2,t_2) = [\mu_1-\mu_2][t_1 - t_2]$

As far as I know, this simple expression is absolutely useless, but it looks good, doesn’t it ?

### Natural exponential families and MGF convergence

Since I’m still quite a bit perplexed by topologies on probability distributions, I can’t resist but talk about them in this post.

Recall that MGF convergence is when a sequence of random variables (or of probability distributions) has a corresponding sequence of MGF which converge pointwise. This implies weak-convergence of the sequence towards the probability distribution with the limit MGF, and also convergence of all moments (among other statistics).

But we can now see this implies quite a bit more still. Consider the sequence of exponential families $p_n(x|t)$ (for $t \in [-r,r]$ the region where the MGF converge pointwise). For each value of $t$, we get MGF convergence to the limit function $p(x|t)$. Pretty cool, no ?

## Is it useful ?

The only real use I know for the exponential family is that it’s used for proving a concentration inequality for probability distributions which have a MGF. I’ll write another post about this since this one is getting a bit too long.

I also have some research of my own where this concept plays an important role, but I’m quite a bit behind on writing this down … $SOON^{TM}$

All in all, the natural exponential family is mostly as cool curiosity, but I hope you can find some use for it in your work.

As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.