The natural exponential family of p(x)

Today, I’ll present a neat little trick to construct a family of probability distribution which all look similar to a base distribution p(x) but which all have different means. And it isn’t p(x-\mu) !!

Once that’s done, I’ll show you in another post how to use this neat trick to prove a cool concentration result.

Definition

Let p(x) be some probability distribution. Let’s assume that “p has a Moment Generating Function (MGF)”: there is some interval around 0 such that:

\displaystyle M(t) = \int p(x) \exp( t x) dx

is finite.

Then, we simply define the natural exponential family of p(x) to be all the probability ditributions:

\displaystyle p(x|t) \propto p(x) \exp(t x)

for all values of t such that M(t) is finite. So it’s just p(x) but shifted around a little bit.

You already know several examples of natural exponential families: if p is a Gaussian, then its natural exponential family is all the other Gaussians with the same variance. If p is a Gamma distribution, its exponential family is also simple: try and see if you can figure it out.

Properties

First important property: all members of the family have a different mean. I’ll prove this in the next paragraph.

Second property: all “reachable” means are reached. If p(x) has a bounded support, then all the values inside the support correspond to a member of the family which has that mean. If the support is \mathbb R, then all values are reached. I don’t know a proof of that though, so take that with a grain of salt.

As a corollary to this, we can parameterize the family not with t but with \mu = E(x|t): this characterizes a single member of the family !

Relation to the cumulant generating function (CGF)

Since p(x) has a MGF, it also has a CGF:

\displaystyle C(t) = \log (M(t))

Normally, when people talk about the derivatives of the MGF and the CGF, they only talk about the derivatives at t=0, but when we talk about the exponential family, we can finally talk about the derivatives at other positions !

The k-th derivative of C(t) at some value of t is, quite simply, the k-th cumulant of p(x|t): one of the members of the exponential family !

Note that it’s not quite as a simple for the MGF M(t): it’s derivatives need to be normalized by M(t) before we obtain a uncentered moment … If this doesn’t speak to you, let’s illustrate it with the second derivative of both: would you rather compute var(x|t) or M(t) E(x^2|t) ?

By looking at C, we can see why each member of the family has a different mean: since C is convex, its derivative is a bijection, and its derivative is: C'(t) = E(x|t). QED.

From the CGF, we can define its Fenchel dual or its convex conjugate D(\mu), which is sometimes called the large-deviations function (which is a bad name if you want my opinion). This dual D is not easy to understand the first time you meet him, but he isn’t that complicated. C is convex, so that its first derivative C'(t) = E(x|t) is a bijection. D is also convex so its first derivative is also a bijection D'(\mu), and it happens to be that D' is exactly the inverse of C' !

Once we are equiped with D, we are able to compute which member of the family has a given mean \mu_0: it’s simply the guy who has t = D'(\mu_0). The value of D(\mu) is also interesting in itself, because it is equal to M(t): it gives us the value of the normalization.

This duality between C and D is the parallel of the dual parameterizations with t and \mu ! This is why this is cool. When we want to know the properties of p(x|t), we compute things on C. When we want to learn about p(x|\mu), we compute things on D.

Divergence between members of the family

An important question is: how similar are two given members of the exponential family ? If we use the Kullback-Leibler divergence, it’s very easy to answer ! Try to prove that:

\displaystyle KL(t_1,t_2) = C(t_2) - C(t_1) - C'(t_1) (t_2 - t_1)

which is “obviously” positive: a convex function like C is always above its tangent at t_2. This isn’t exactly the simplest expression for the divergence, but if we now compute the symmeterized KL divergence, we get:

\displaystyle KL(t_1,t_2) + KL(t_2,t_2) = [\mu_1-\mu_2][t_1 - t_2]

As far as I know, this simple expression is absolutely useless, but it looks good, doesn’t it ?

Natural exponential families and MGF convergence

Since I’m still quite a bit perplexed by topologies on probability distributions, I can’t resist but talk about them in this post.

Recall that MGF convergence is when a sequence of random variables (or of probability distributions) has a corresponding sequence of MGF which converge pointwise. This implies weak-convergence of the sequence towards the probability distribution with the limit MGF, and also convergence of all moments (among other statistics).

But we can now see this implies quite a bit more still. Consider the sequence of exponential families p_n(x|t) (for t \in [-r,r] the region where the MGF converge pointwise). For each value of t, we get MGF convergence to the limit function p(x|t). Pretty cool, no ?

Is it useful ?

The only real use I know for the exponential family is that it’s used for proving a concentration inequality for probability distributions which have a MGF. I’ll write another post about this since this one is getting a bit too long.

I also have some research of my own where this concept plays an important role, but I’m quite a bit behind on writing this down … SOON^{TM}

All in all, the natural exponential family is mostly as cool curiosity, but I hope you can find some use for it in your work.


As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.

Advertisements

Struggles with probability topology: strengthening the weak-topology

Today, I continue my series on topologies on probability distributions. This series started when I realized that the common topologies (the weak and total-variation topologies) are extremely weak, in that they don’t imply convergence of moments.

In this series, I mostly looked at alternative topologies: the KL, the MGF, the Wasserstein, etc. Today, I will instead focus on how to strenghten the weak topology: adding conditions that ensure that the limit is better behaved that is should be from just the weak-topology.

The weak-topology

Let’s first recall how the weak topology works. A random variable sequence X_n converges to X iff the expected value of any bounded continuous statistic f(x) converges:

\displaystyle E( f(X_n) ) \rightarrow E( f(X) )

That’s the basic definition but the most useful characterization is that weak convergence is equivalent to convergence of all statistics of the form \exp(itx) for t \in \mathbb{R} (which you will recognize as the Fourier basis).

The weak topology is the most usual topology, but it has what I consider to be confusing behavior. Consider the following example:

X_n is a mixture distribution. With probability \frac{n-1}{n} pick a value from a Gaussian distribution centered at 0 with variance 1; with probability \frac{1}{n}, pick instead from a Gaussian with mean n. The mean of $X_n$ is always 1, and it’s variance grows to infinity with n. However, X_n converges weakly to a Gaussian centered at 0. Faced with a situation like this, I’d rather say that X_n \rightarrow Y where Y is a degenerate pseudo-Gaussian with mean 1 and infinite variance OR refuse to acknowledge that X_n converges.

Ensuring some higher convergence

Weird convergence behavior under the weak topology always has the same structure as the example I’m presenting here. All statistics that grow faster than some limit statistic (in this case, everything that grows faster than |x|) see their expected value diverge, all statistics growing slower than the limit converge to the correct value. Thus, we can strenghten the weak-convergence of a given sequence by identifying the “critical statistic” of that sequence, or at least by finding statistics which grow slower than the critical statistic.

Let’s state this more precisely. If X_n \rightarrow X weakly, and we know that some statistic is bounded in the sequence: E( f(X_n) ) \leq K, then that implies that the expected value of any statistic which is dominated by f converges to the correct value. \forall g,\text{ } \frac{g}{f} \leq K_g then:

\displaystyle E( g(X_n) ) \rightarrow E( g(X) )

Why is that the case ? Weird behavior under the weak topology is all about the fact that the weak topology controls poorly the tails of the X_n. The condition E( f(X_n) ) \leq K tells us that whatever is happening in the tails is in O(f^-1) so that we are safe for all functions which grow more slowly.

Ensuring MGF convergence

The strongest topology I have talked about so far is the MGF topology. MGF convergence implies weak convergence, convergence of all moments, and convergence of all statistics that grow slower than some exponential.

With the above condition, we can ensure MGF convergence from weak convergence AND convergence (or boundedness) of \exp( |r x|). We thus have a way to go from our weakest convergence to our strongest convergence in an instant.

Can opposite violations happen ?

We know that, under the weak topology, sometimes some expected values are infinite when they should be finite: can weird violations also happen the opposite way, ie: f such that E( f(x) ) = \infty but E( f(X_n) remains bounded ?

Thankfully, this can’t happen (for continuous f at least). First, remark that we can restrict ourselves to positive f(x) because for f that aren’t absolutely convergent, I don’t know how to define E( f(x) ). Showing that E( f(X_n) ) will always be bigger than any bound we set is straighforward. Choose B, find a compact domain I such that E( f(x) 1(x \in I) ) > 2B. Then f(x) 1(x \in I) is a bounded and continuous statistic. By weak-convergence of X_n, we get our result.

Complete agreement

Let X_n be some weakly convergent sequence, with limit X. We saw that the only weird things that could happen was that some statistics that have finite expected value under X have a limit expected value that doesn’t match (very possibly infinite). We can then define the absolute strongest convergence which would be that for any statistic which has finite expected value under X the limit of the expected values converges to the correct value. In that case, for ALL statistics we have that the expected value under X matches the limit of the expected value.

In practice, this complete agreement topology is all about tail behavior so it can probably be formalized in that way. I conjecture that something like:

\displaystyle \frac{p(x)}{p_n(x)} \leq K_1 \text{ and }\frac{p_n(x)}{p(x)} \leq K_2

and weak convergence would ensure “complete agreement” convergence. If you’re not convinced that my condition works, consider it for some examples like p(x) = x^-3 and q(x)=\exp( - x). I’ll try to prove this in a further post.

Summary and conclusion

Today, I presented ways to strenghten the weak topology. We found that having weak convergence and boundedness of the expected value of one statistic ensures the correct convergence of all statistics that grow slower. In particular, this enabled us to give a very simple condition to ensure MGF convergence from weak convergence. From this, we also saw that we could define a “complete agreement” topology which would ask that the expected value of any statistic are matched and conjectured that this simply requires weak convergence and similar tail behavior.

A technical note

I have avoided throughout this post to refer to equivalence classes of statistics because I was afraid that this might be confusing for those which are encountering this concept for the first time. For readers who understand (or think they understand) concepts like small-o, big-O and big-theta, try to rephrase what I said earlier with those terms.


As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.

Struggles with probability topology: the Moment-Generating-Function topology

I have been struggling for a few months with topologies on probability distributions. I started when I realized that the common topologies (weak topology and total-variation topology) are extremely weak: they do not imply convergence of any moments. I then set out to understand this weakness better, and find better topologies.

Today, I will discuss the MGF (moment generating function) topology, and see whether it’s satisfying.

The MGF topology

Let’s first introduce how this topology works. Let’s first recall exactly what the MGF is. For a random variable X, the MGF is the function:

\displaystyle t \rightarrow E( \exp(t X) )

which, if you know your analysis, you will recognize as the Laplace transform. We’ll note M_X(t) the MGF.

The MGF is convex. Actually, every pair derivative of the MGF is positive so I like to call it “hyper-convex”. And further, even the log of the MGF (which is the Cumulant Generating Function CGF) is hyper-convex. To sum up, the MGF is extremely regular.

The MGF topology is simple to define. A sequence of random variables X_n is said to converge iff there exists a compact neighborhood of 0 such that the sequence of the MGFs converge pointwise to a limit MGF. In math:

\displaystyle \forall |t|\leq r, M_{X_n}(t) \rightarrow M_X(t)

Properties

MGF convergence is stronger than weak convergence (though, so far, I haven’t found any proof of that fact). Does it imply some convergence for the moments ? It does !!

MGF convergence implies convergence of all moments. In fact, it implies convergence of all statistics which grow slower than \exp( |r X| ) where r is the end of the convergence region.

Finally, we have a convergence that I feel justified in calling “strong”: it implies convergence of a large ensemble of key properties of random variables. However, even in this case, some funky things can happen. Indeed, MGF convergence doesn’t imply convergence of all statistics: it only implies convergence of all statistics that grow slower than a given exponential function. It’s easy to build counter-examples where there is MGF convergence with some statistics failing to have the appropriate limits.

Also, note an interesting fact: two probability distribution can have widely different density functions but almost identical cumulant generating functions (see this reference). McCullagh finds that very surprising, but it isn’t: the MGF is sensitive to moments, and not directly to the density, and almost identical moments can be obtained from very different densities.

Conclusion

The MGF topology isn’t perfect by any means, but it is considerably stronger than the weak topology in that it implies convergence of all moments. It seems like a good tool to have in your arsenal, especially since people rarely talk about it.


As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.

Struggles with probability topology: the Kullback-Leibler topology

In an earlier blog post, I have stated my intense dislike for the common topologies on probability distributions (“common topologies” being the weak and the total-variation topology). In this post, I present some of the properties of the Kullback-Leibler topology to try to see if I’m satisfied by it.

The problems with the weak and TV topologies

Let’s first quickly summarize my issues with the weak and TV topologies. The essence of my objection is that I find that both of those are too weak: sequences that I feel shouldn’t converge are found to be convergent. The problem is basically that they aren’t sensitive enough to what is going on in the tails of probability distributions. You can workaround this weakness of the topologies (I’ll make a further post on this), but I’d rather be working directly on a good topology.

The KL topology

The KL topology is simply the topology associated with the pre-metric:

\displaystyle KL(p,q) = \int p(x) \log( \frac{p}{q}) \geq 0

This is only a pre-metric because it isn’t symetric and doesn’t respect the triangle inequality.

The KL divergence is important for several reasons. First of all, it is key in all likelihood based approaches (and, in particular, in Bayesian methods). If our sampling model (the probability distribution of the data) is p_0(x) and our generative model has conditional distributions p(x|\theta), then the expected log-likelihood for a given value of \theta is the KL divergence (offset by a constant):

\displaystyle E( \log p(x|\theta) | p_0(x) ) = KL( p_0(x), p(x|\theta) ) - C

Second key property, the KL divergence is invariant for invertible changes of variable. This is an important property for divergence on probability distributions: the choice of one “reference frame” or another shouldn’t impact how much two random variables differ.

Third key property, the KL divergence is always reduced by non-invertible transformations.

Last but not least, KL is stronger than TV:

\displaystyle KL(p,q) \geq 2 D_{TV}^2(p,q)

This last result is the reason why I believed that KL might solve my topologies issues.

A small computational note: if you are facing actual data and you want to compare two data-sets X_i and Y_i, please do not try to compute KL(X,Y) in some weird way or another. If you have to fit a probability distribution to your data before you can compute a divergence, then that’s a bad divergence and you shouldn’t use it. Same thing for the mutual information: don’t use it. Use divergences that can be computed from only a few low-order moments.

How strong is it ?

KL is at least as strong as TV, but is it really strictly stronger ? It actually depends on which direction of KL we use ! There are actually two KL topologies, because it isn’t a symmetric divergence. These two topologies are slightly different.

Let p_n(x) be a sequence of probability distributions. If there exists a limit p(x) such that:

\displaystyle KL(p,p_n) \rightarrow 0

then $p_n \rightarrow p$ in TV … and as far as I can tell, that’s it ! We don’t get much from using KL in that direction. If you want an example of this behavior, let p_n be the density of a mixture of two Gaussians: with probability \frac{n-1}{n}, pick the value from a Gaussian centered at 0 with variance 1, with probability n^{-1} pick it from any other probability distribution.

However, if we have convergence in the other KL topology:

\displaystyle KL(p_n,p) \rightarrow 0

Then we do have something stronger than TV. This depends on the tail behavior of the limit p(x). For all statistics that grow slower at infinity than \log p(x), the expected values of those statistics converge to the limit expected value. For example, if the limit is Gaussian, then \log p(x) is a second degree polynomial, so that the mean and variance under p_n converge to the mean and variance of the limit. However, we can’t have more than that, as shown by the following example of a mixture distribution: with probability \frac{n-1}{n}, pick the value from a Gaussian centered at 0 with variance 1, with probability n^{-1} pick it from a Gaussian centered at n^{1/3} instead. The KL limit of this series is a Gaussian centered at 0, but the limit of the third moments is infinity, even though the limit has finite third moment.

Yet another disappointing divergence

In the end, the two KL topologies both disappoint me. KL(p,p_n) seems to be only marginally stronger than TV, while KL(p_n,p) is quite stronger: it implies convergence of some moments, but remains too weak for my taste. I’d really like something that implies convergence of all moments, though it now seems like that might not exist. However, I still have some hope: I’m currently thinking about the cumulant generating function, and it’s cousin the moment generating function. Pointwise convergence of those seems to be a very strong requirement that should imply convergence of all moments. However, I haven’t found much on that, but once I do, I’ll add yet another post.


As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.