I don’t like the conventional topologies on probability distributions

Topologies are great. They are the tool which enable you to say that a sequence of objects converge to another object in the end. However, I really don’t like the conventional topologies which people use on probability distributions, because I don’t think they respect fundamental intuitions of what it should mean to converge.

Let’s start this by presenting the two villains: the weak topology, and the total-variation topology.

The weak topology

The weak topology is defined in the following way. A sequence of random variables (rv) X_n converges to some limit rv X if the expected value of any bounded function f converges:

\displaystyle \forall f: E[ f(X_n) ] \rightarrow E[ f(X) ]

This is the most common way of defining that some sequence of rv converges. If you have ever taken a course or two of probability theory, you have used this before (and its extensions: convergence “in probability”, “almost sure” and “sure” convergence)

In order to prove weak convergence, we actually only need to prove the convergence of these functions: f_{\theta} = \exp(i \theta x), the Fourier basis of \mathbb R, so it’s not as hard as it might look from the definition.

The total-variation distance, and its topology

The total-variation topology is basically an all-around improvement of the weak-topology.

First of all, let’s define the total-variation distance. I prefer to define it as a distance between probability density functions (pdf) p(x) and q(x):

\displaystyle D_{TV}(p,q) = \max_A p(x \in A) - q(x \in A)

In words: we find the event for which the probability is maximally different under p and q. The difference of probability is the total-variation distance.

The total-variation distance is a metric on the space of probability distribution and so you can use it to define a topology. X_n converges to X if the probability densities p_n converge according to the the total variation distance:

\displaystyle D_{TV} (p_n,p) \rightarrow 0

There is a very strong link between D_{TV} and the weak convergence. Indeed, we can compute the distance by computing a max over all functions which are bounded by 1:

\displaystyle D_{TV}(p,q) = \max_{||f||_{\infty} \leq 1} E_p(f) - E_q(f)

So convergence in total-variation is equivalent to uniform convergence of all bounded functions, which is why I said that TV improves on the weak convergence.

Why they’re weird

Let’s recapitulate. Both the weak convergence and the TV convergence prove that the expected values of bounded functions converge, with slight differences between the two: TV is a uniform convergence, whereas the weak convergence is more “point wise”. So why do I think that they’re weird notions of convergence ?

My problem is that I’m also very interested in convergence of some unbounded functions. For example, the mean, the variance, the skew, the kurtosis, etc. I really dislike that both the weak convergence and the total variation do not require the convergence of all those important statistics.

A paradoxical example

Let me give you an example of the pathological behavior of these two topologies.

Let’s define a random variable sequence X_n of rv which have a mixture distribution. With probability \frac{n-1}{n}, X_n is picked according to a Gaussian distribution centered at 0. With probability \frac{1}{n}, we pick it instead as a Gaussian centered at n. Both Gaussians have variance 1.

All X_n have mean 1, and their variance grows to infinity when n \rightarrow \infty. It seems unreasonable to believe that that sequence of RV converges to a well behaved distribution, or to anything at all really.

But, according to both the weak and the TV topologies, X_n converges to a Gaussian centered at 0, with variance 1. I’ll leave the proof as an exercise to the interested reader. Start from the \max_f definition and it’s straightforward to find that D_{TV} \leq \frac{2}{n}.

It absolutely blew my mind when I realized that the TV topology predicts that the sequence X_n converges towards a Gaussian at 0. To me, this means that it’s myopic: they do not look at the full picture. If X_n has very rare but extremely large events, they get ignored by the TV distance.

What’s the solution ?

To be honest, I don’t really now. I was hoping for a second that the Wasserstein distances W_k(p,q) might prove better, but they are equivalent to TV + convergence of the k^{th} first moments, so they still seem a little bit too weak for me.

There isn’t much left after that. The Kullback-Leibler divergence KL(p,q) is one avenue I’m looking at right now. I’ll make another blog post on it once I understand better what convergence in KL implies, but while it seemed good at first, I’m not so sure now.

Is it really that bad ?

To be honest again, the weak and TV topologies are not that bad. From a statistical point of view, you can still compute a lot of relevant quantities from the limit rv. For example, you can construct asymptotic confidence intervals in the following way.

Let’s build a confidence interval on X_n. Assume that you want a 90% confidence interval: an interval in which the random variable is present 90% of the time. First, we select a 95% confidence interval on X which we’ll call I_{95}. For any n, as soon as D_{TV} (p_n,p) \leq 0.05, the probability of X_n \in I_{95} becomes bigger than 90% from a simple application on the definition of the TV distance. I_{95} is thus an (asymptotic) confidence interval.

It’s thus quite possible that I’m over-reacting, but I think there are good reasons for seeking and working with stronger topologies than the conventional weak and TV topologies.

As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you