What makes a mathematical concept “good”

Today, I’ll discuss a topic that I believe is very important both for teaching and explaining but also for understanding math: what makes some mathematical concepts “good” and “better” than others. A bit abstract and dry don’t you think? Here’s a probability example: what justifies the definition of the variance ( \text{var} = E[ (x-\mu)^2 ]) and the standard deviation (\sigma = \sqrt{\text{var}} ) ? If a student asked you that question, would you be able to give him a good answer?

I believe that the quality of a mathematical concept needs to be measured along two different axes:

  • First of all, our mathematical concept needs to capture some intuitions that are relevant for the problem at hand.
  • Second, our concept needs to be mathematically convenient: it needs to just click in whatever mathematical setting we are working in

The conjunction of those two is very important because it’s often possible to capture the same intuition with many different concepts, but most of them won’t be mathematically convenient. There is something that I find really mysterious in math which is that, in order to prove something, there really aren’t many ways which “flow”.

Note that it sometimes happen that we sometimes want to capture the same intuition with different concepts because a concept might only work well in some small set of circumstances. Outside of those, maybe a different concurrent concept is a better fit.

Also note that the first criterion can be a bit blurry. As you learn higher and higher level math, you also learn higher and higher level concepts. For these, the intuition they capture can be something that is mostly mathematical. For these, it might be a bit harder to distinguish my two axes. Maybe the best way to separate them is to think of them as a long term goal (first axis) that we care about reaching, and convenient properties (second axis) that make the end term goal easy to manipulate.

This was way too dry. Let’s go back to examples

First example: the variance

Let’s try to answer our hypothetical student’s question: why do we define variance and standard-deviation the way we do. We’ll answer along our two axes.

First: what intuition do the variance and std capture? This is one is easy enough. They capture the spread of a random variable. It sounds almost tautological but: if a random variable has a bigger variance, it means that it is more variable than another with a smaller variance. The standard deviation measures (very roughly) the width of where the random variable can be around its mean.

At this point, our hypothetical student interjects: there are many other measures which would capture roughly “the width of where the random variable can be around its mean”. For example, the L1 deviation: \delta = E( |x-\mu| ). Why don’t we use this one instead?

The reason why \delta isn’t great is because it isn’t mathematically convenient. The key property that the variance offers is that it sums between independent variables. Sums of random variables are ubiquitous so this is a really important property. There are plenty of other ways in which the variance is a convenient concept, but I believe this to be the key one (though if you have more, please tell me). The standard deviation inherits the mathematical convenience of the variance so it’s a better way of measuring the width of a random variable.

We can now give a short answer to our student. We define variance this way because it captures something we care about: a measure of the spread of a random variable, and it has good mathematical properties, a key one being that it sums between independent variables.

A context in which variance isn’t the good concept

Variance is not always the best concept of random variable width though, which is why we don’t always want to work with it. For example, when we use Bayes formula, we are not working with sums of random variables but with products of density functions. The concept of variance still captures the right intuition but stops being so mathematically convenient. One concept that does work for products of density functions is the Fisher information but I won’t describe it here because that would be too long.

Why it matters

I believe it’s quite important to keep in mind what makes a good mathematical concept. First of all, when learning about a new subject, it’s really important to learn its structure. If for every mathematical concept you encounter you can identify what intuitions it captures and the various ways and contexts in which it’s mathematically convenient, that structures your thoughts and it helps learn easier and better. Math is all about structure, and making sure your understanding of math is well-structured is very important.

The converse of that is that it’s also very important when teaching to keep in mind why concepts are good. By structuring your presentations and communicating clearly what makes the concepts you use good, you help your audience to understand the subject and everybody benefits.

Finally, it’s very important to keep in mind what makes a good mathematical concept good when doing research. Sometimes the solution to a problem can’t be reached from the currently used concepts of a field. These problems require developing new concepts and having a good idea of what one should be looking for is quite important there.

Conclusion

A good concept captures something we care about while being convenient to work with. Keep this in mind when doing research, and when learning and teaching math: being explicit with structure can only help you.


As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you

The Kullback-Leibler variance: a divergence measure for unnormalized probability distributions

Today, I’ll introduce a new divergence measure on probability distributions which I have found to be useful in some of my work. I call it the Kullback-Leibler divergence (though if you have a better name in mind, or if it already has a name, please tell me), and it’s basically a slight variant of the conventional KL divergence

The great thing with the KL-variance (or KLV for short) is that it doesn’t require that we compute the normalization constant of both probability distributions. Normalization constants are sometimes very hard to compute, but they are required for all the other divergence measures I know.

edit: I have found a horrible flaw in what I wrote here. I’ll leave this page up, but please note that I was wrong when I said that the KL variance provides an upper bound for the KL divergence. I’ll try to see if I can salvage the result in the future.

KLV: definition

The conventional KL divergence is:

\displaystyle KL(p,q) = E_p(\log( \frac{p}{q} )

And the KLV, by definition, is just the same expression but with a variance instead:

\displaystyle KL_V(p,q) = \text{var}_p (\log( \frac{p}{q} ) )

Just like the KL divergence, the KL variance is asymetric. And just like the KL, it’s also useful to consider a symmetrized version of the KLV. We’ll see about that in a minute.

Key properties

Avoiding the normalization constant

In many cases, we know the log-probability function \log( p(x) ) but only up to a constant. If we integrate the probability function, it would not be equal to 1: \int p(x) dx \neq 1. This is super important for the KL divergence because we need to compute both normalization constants before we can compute KL. Otherwise, the KL divergence won’t necessarily be positive. (Note that there is a version of KL which applies to non-normalized distributions, but it’s annoying too: it tells us that p(x) and q(x)=2p(x) are different when they’re not).

On the contrary, the KLV doesn’t care about the normalization constant: it gets removed !!! Well, maybe you would need it to compute the expected value but here are some ways around that:

  • Sampling methods, among which Gibbs sampling
  • Moment inequalities (which is what I used in my own research): Brascamp-Lieb moment inequalities, any form of concentration inequalities, etc

Relationship to KL and total-variation distance

edit: this section is wrong. The link is not an upper bound but a rough approximation. I’ll try to see if I can fix this in the future

It’s good to have a computable quantity, but does it tell us anything useful ? The symmetrized KLV actually upper bounds the symmetrized KL divergence, and through it upper bounds the total-variation distance. So convergence in symmetrized KLV implies KL convergence and total variation convergence. Pretty cool no ?

Proof: Consider p and q two density functions. Define the tilted interpolating exponential family:

\displaystyle t(x;{\lambda}) = p^{(1-{\lambda})} q^{\lambda}

which goes from p to q as \lambda goes from 0 to 1.

Consider the integral of t against x:

\displaystyle L(\lambda) = \int t(x;\lambda) dx

L is easily found to be what I call a super-convex function: every pair derivative is positive (which I’ll leave to you as an exercise). edit: this is wrong

L also has some interesting derivatives:

  • L'(0) = KL(p,q)
  • L'(1) = KL(q,p)
  • L''(0) = KL_V(p,q)
  • L''(1) = KL_V(q,p)

We then use the convexity of L''(x) by bounding it by its chord: L''(x) \leq (1-x) L''(0) + x L''(1) to end up with: edit: also wrong. The following is not an upper bound but a rough approximation

\displaystyle KL(p,q) + KL(q,p) \leq \frac{1}{2} [KL_V(p,q) + KL_V(q,p)]

The relation to the total variation comes from the Pinsker/Kullback inequality:

\displaystyle KL(p,q) \geq 2 [D_{TV} (p,q)]^2

Conclusion

Now you know about the KLV ! I hope you can apply it in your research in some way.

Do you know any other divergence measures which work well with unnormalized distributions ?


As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.

I don’t like the conventional topologies on probability distributions

Topologies are great. They are the tool which enable you to say that a sequence of objects converge to another object in the end. However, I really don’t like the conventional topologies which people use on probability distributions, because I don’t think they respect fundamental intuitions of what it should mean to converge.

Let’s start this by presenting the two villains: the weak topology, and the total-variation topology.

The weak topology

The weak topology is defined in the following way. A sequence of random variables (rv) X_n converges to some limit rv X if the expected value of any bounded function f converges:

\displaystyle \forall f: E[ f(X_n) ] \rightarrow E[ f(X) ]

This is the most common way of defining that some sequence of rv converges. If you have ever taken a course or two of probability theory, you have used this before (and its extensions: convergence “in probability”, “almost sure” and “sure” convergence)

In order to prove weak convergence, we actually only need to prove the convergence of these functions: f_{\theta} = \exp(i \theta x), the Fourier basis of \mathbb R, so it’s not as hard as it might look from the definition.

The total-variation distance, and its topology

The total-variation topology is basically an all-around improvement of the weak-topology.

First of all, let’s define the total-variation distance. I prefer to define it as a distance between probability density functions (pdf) p(x) and q(x):

\displaystyle D_{TV}(p,q) = \max_A p(x \in A) - q(x \in A)

In words: we find the event for which the probability is maximally different under p and q. The difference of probability is the total-variation distance.

The total-variation distance is a metric on the space of probability distribution and so you can use it to define a topology. X_n converges to X if the probability densities p_n converge according to the the total variation distance:

\displaystyle D_{TV} (p_n,p) \rightarrow 0

There is a very strong link between D_{TV} and the weak convergence. Indeed, we can compute the distance by computing a max over all functions which are bounded by 1:

\displaystyle D_{TV}(p,q) = \max_{||f||_{\infty} \leq 1} E_p(f) - E_q(f)

So convergence in total-variation is equivalent to uniform convergence of all bounded functions, which is why I said that TV improves on the weak convergence.

Why they’re weird

Let’s recapitulate. Both the weak convergence and the TV convergence prove that the expected values of bounded functions converge, with slight differences between the two: TV is a uniform convergence, whereas the weak convergence is more “point wise”. So why do I think that they’re weird notions of convergence ?

My problem is that I’m also very interested in convergence of some unbounded functions. For example, the mean, the variance, the skew, the kurtosis, etc. I really dislike that both the weak convergence and the total variation do not require the convergence of all those important statistics.

A paradoxical example

Let me give you an example of the pathological behavior of these two topologies.

Let’s define a random variable sequence X_n of rv which have a mixture distribution. With probability \frac{n-1}{n}, X_n is picked according to a Gaussian distribution centered at 0. With probability \frac{1}{n}, we pick it instead as a Gaussian centered at n. Both Gaussians have variance 1.

All X_n have mean 1, and their variance grows to infinity when n \rightarrow \infty. It seems unreasonable to believe that that sequence of RV converges to a well behaved distribution, or to anything at all really.

But, according to both the weak and the TV topologies, X_n converges to a Gaussian centered at 0, with variance 1. I’ll leave the proof as an exercise to the interested reader. Start from the \max_f definition and it’s straightforward to find that D_{TV} \leq \frac{2}{n}.

It absolutely blew my mind when I realized that the TV topology predicts that the sequence X_n converges towards a Gaussian at 0. To me, this means that it’s myopic: they do not look at the full picture. If X_n has very rare but extremely large events, they get ignored by the TV distance.

What’s the solution ?

To be honest, I don’t really now. I was hoping for a second that the Wasserstein distances W_k(p,q) might prove better, but they are equivalent to TV + convergence of the k^{th} first moments, so they still seem a little bit too weak for me.

There isn’t much left after that. The Kullback-Leibler divergence KL(p,q) is one avenue I’m looking at right now. I’ll make another blog post on it once I understand better what convergence in KL implies, but while it seemed good at first, I’m not so sure now.

Is it really that bad ?

To be honest again, the weak and TV topologies are not that bad. From a statistical point of view, you can still compute a lot of relevant quantities from the limit rv. For example, you can construct asymptotic confidence intervals in the following way.

Let’s build a confidence interval on X_n. Assume that you want a 90% confidence interval: an interval in which the random variable is present 90% of the time. First, we select a 95% confidence interval on X which we’ll call I_{95}. For any n, as soon as D_{TV} (p_n,p) \leq 0.05, the probability of X_n \in I_{95} becomes bigger than 90% from a simple application on the definition of the TV distance. I_{95} is thus an (asymptotic) confidence interval.

It’s thus quite possible that I’m over-reacting, but I think there are good reasons for seeking and working with stronger topologies than the conventional weak and TV topologies.


As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you