Struggles with probability topology: the Kullback-Leibler topology

In an earlier blog post, I have stated my intense dislike for the common topologies on probability distributions (“common topologies” being the weak and the total-variation topology). In this post, I present some of the properties of the Kullback-Leibler topology to try to see if I’m satisfied by it.

The problems with the weak and TV topologies

Let’s first quickly summarize my issues with the weak and TV topologies. The essence of my objection is that I find that both of those are too weak: sequences that I feel shouldn’t converge are found to be convergent. The problem is basically that they aren’t sensitive enough to what is going on in the tails of probability distributions. You can workaround this weakness of the topologies (I’ll make a further post on this), but I’d rather be working directly on a good topology.

The KL topology

The KL topology is simply the topology associated with the pre-metric:

\displaystyle KL(p,q) = \int p(x) \log( \frac{p}{q}) \geq 0

This is only a pre-metric because it isn’t symetric and doesn’t respect the triangle inequality.

The KL divergence is important for several reasons. First of all, it is key in all likelihood based approaches (and, in particular, in Bayesian methods). If our sampling model (the probability distribution of the data) is p_0(x) and our generative model has conditional distributions p(x|\theta), then the expected log-likelihood for a given value of \theta is the KL divergence (offset by a constant):

\displaystyle E( \log p(x|\theta) | p_0(x) ) = KL( p_0(x), p(x|\theta) ) - C

Second key property, the KL divergence is invariant for invertible changes of variable. This is an important property for divergence on probability distributions: the choice of one “reference frame” or another shouldn’t impact how much two random variables differ.

Third key property, the KL divergence is always reduced by non-invertible transformations.

Last but not least, KL is stronger than TV:

\displaystyle KL(p,q) \geq 2 D_{TV}^2(p,q)

This last result is the reason why I believed that KL might solve my topologies issues.

A small computational note: if you are facing actual data and you want to compare two data-sets X_i and Y_i, please do not try to compute KL(X,Y) in some weird way or another. If you have to fit a probability distribution to your data before you can compute a divergence, then that’s a bad divergence and you shouldn’t use it. Same thing for the mutual information: don’t use it. Use divergences that can be computed from only a few low-order moments.

How strong is it ?

KL is at least as strong as TV, but is it really strictly stronger ? It actually depends on which direction of KL we use ! There are actually two KL topologies, because it isn’t a symmetric divergence. These two topologies are slightly different.

Let p_n(x) be a sequence of probability distributions. If there exists a limit p(x) such that:

\displaystyle KL(p,p_n) \rightarrow 0

then $p_n \rightarrow p$ in TV … and as far as I can tell, that’s it ! We don’t get much from using KL in that direction. If you want an example of this behavior, let p_n be the density of a mixture of two Gaussians: with probability \frac{n-1}{n}, pick the value from a Gaussian centered at 0 with variance 1, with probability n^{-1} pick it from any other probability distribution.

However, if we have convergence in the other KL topology:

\displaystyle KL(p_n,p) \rightarrow 0

Then we do have something stronger than TV. This depends on the tail behavior of the limit p(x). For all statistics that grow slower at infinity than \log p(x), the expected values of those statistics converge to the limit expected value. For example, if the limit is Gaussian, then \log p(x) is a second degree polynomial, so that the mean and variance under p_n converge to the mean and variance of the limit. However, we can’t have more than that, as shown by the following example of a mixture distribution: with probability \frac{n-1}{n}, pick the value from a Gaussian centered at 0 with variance 1, with probability n^{-1} pick it from a Gaussian centered at n^{1/3} instead. The KL limit of this series is a Gaussian centered at 0, but the limit of the third moments is infinity, even though the limit has finite third moment.

Yet another disappointing divergence

In the end, the two KL topologies both disappoint me. KL(p,p_n) seems to be only marginally stronger than TV, while KL(p_n,p) is quite stronger: it implies convergence of some moments, but remains too weak for my taste. I’d really like something that implies convergence of all moments, though it now seems like that might not exist. However, I still have some hope: I’m currently thinking about the cumulant generating function, and it’s cousin the moment generating function. Pointwise convergence of those seems to be a very strong requirement that should imply convergence of all moments. However, I haven’t found much on that, but once I do, I’ll add yet another post.

As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.