Today, I’ll introduce a new divergence measure on probability distributions which I have found to be useful in some of my work. I call it the Kullback-Leibler divergence (though if you have a better name in mind, or if it already has a name, please tell me), and it’s basically a slight variant of the conventional KL divergence
The great thing with the KL-variance (or KLV for short) is that it doesn’t require that we compute the normalization constant of both probability distributions. Normalization constants are sometimes very hard to compute, but they are required for all the other divergence measures I know.
edit: I have found a horrible flaw in what I wrote here. I’ll leave this page up, but please note that I was wrong when I said that the KL variance provides an upper bound for the KL divergence. I’ll try to see if I can salvage the result in the future.
The conventional KL divergence is:
And the KLV, by definition, is just the same expression but with a variance instead:
Just like the KL divergence, the KL variance is asymetric. And just like the KL, it’s also useful to consider a symmetrized version of the KLV. We’ll see about that in a minute.
Avoiding the normalization constant
In many cases, we know the log-probability function but only up to a constant. If we integrate the probability function, it would not be equal to 1: . This is super important for the KL divergence because we need to compute both normalization constants before we can compute KL. Otherwise, the KL divergence won’t necessarily be positive. (Note that there is a version of KL which applies to non-normalized distributions, but it’s annoying too: it tells us that and are different when they’re not).
On the contrary, the KLV doesn’t care about the normalization constant: it gets removed !!! Well, maybe you would need it to compute the expected value but here are some ways around that:
- Sampling methods, among which Gibbs sampling
- Moment inequalities (which is what I used in my own research): Brascamp-Lieb moment inequalities, any form of concentration inequalities, etc
Relationship to KL and total-variation distance
edit: this section is wrong. The link is not an upper bound but a rough approximation. I’ll try to see if I can fix this in the future
It’s good to have a computable quantity, but does it tell us anything useful ? The symmetrized KLV actually upper bounds the symmetrized KL divergence, and through it upper bounds the total-variation distance. So convergence in symmetrized KLV implies KL convergence and total variation convergence. Pretty cool no ?
Proof: Consider p and q two density functions. Define the tilted interpolating exponential family:
which goes from p to q as goes from 0 to 1.
Consider the integral of t against x:
L is easily found to be what I call a super-convex function: every pair derivative is positive (which I’ll leave to you as an exercise). edit: this is wrong
L also has some interesting derivatives:
We then use the convexity of by bounding it by its chord: to end up with: edit: also wrong. The following is not an upper bound but a rough approximation
The relation to the total variation comes from the Pinsker/Kullback inequality:
Now you know about the KLV ! I hope you can apply it in your research in some way.
Do you know any other divergence measures which work well with unnormalized distributions ?
As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.