# Nips day 3

Here is what I learned during the third day of nips.

The day started by a very very cool talk by Kyle Cranmer from CERN. He did a great job explaining how physicists performed the statistical analysis behind the discovery of the Higgs boson. His talk went in great detail but remained extremely easy to understand and was absolutely great (probably my favorite of the whole conference). He then also had a few great proposals for interesting problems that the community might be interested in modeling.

In the afternoon, there was a very entertaining talk by Marc Raibert from the boston dynamics company. They do the wonderful robots that you probably have seen on youtube. I was a little bit sad that he didn’t give us too much insight into how they actually make the robots move but he did show a lot and had lots of entertaining videos so he gets a shout-out.

Another cool idea in the afternoon, that was a lot more focused on practical theory, was presented by Damien Scieur, Alexandre d’Aspremont and Francis Bach. They showed a post-processing method for optimizing systems. Their method is a very small and cheap trick that you can plugin to any optimization method to make it better. Overall, it was pretty cool, but I’m not sure if I understand optimization well enough to give the best commentary on that.

# Nips day 2

This is the second day of nips. Here is what I learned.

In the morning, Matthias Seeger (Amazon) presented the model that they use to forecast demand. That was very interesting. That was basically a linear regression with handcrafted features (he didn’t go in detail there about how they were crafted). The regression itself was also interesting: they want to regress the number of sales in the day to predictors; they did so by having two logistic regression for determining whether you would get 1 sale or more, 2 sales or more, and then a Poisson likelihood for all sales higher than that. This makes the system able to model the fact that the first few values are much higher than they should be in a Poisson model, while modeling the tail behavior in a simple way.
Another interesting bit: they have periods where products aren’t in stock, so they can’t be sold. If they ignore these periods, their model goes wrong (it under-estimates the desirability of the product). If they model these correctly: they are no sales because the product is out-of-stock, then their model is way better. As always, the generative model should match the physical realities of the data.

In the afternoon, there was also a great talk by Saket Navlakha from the Salk institute. I’m normally not a great fan of bio-mimetics approach (trying to use biology to improve machine learning techniques, or, more generally, engineering technology; in most cases, the constraints we deal with as engineers are extremely different from the constraints that biology is dealing with) but he had two great examples of exactly that. The first one consisted of creating a communication network by removing edges from a fully-connected (or just very dense) network.

Final great thing was a very particular mixture of Bayesian methods and deep learning by Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet and Ole Winther. They wanted to model sequential data (recordings of speech). A good idea for a model would use a latent markov-chain, conditional on which the sound at each time step is independent (a hidden-markov-chain model). Normally, people would use a very simple latent model: a Kalman filter. This doesn’t work well at all as it as too simple dynamics compared to the actual dynamics of the signal being modelled. What Marco and co-authors propose to use instetad is to use a deep-network to represent the dynamics of the latent space oO. They then are able to perform inference on the weights of their deep network !!?! and finally they learn to use an inference network to take a soundwave as an input and return a variational inference approximation of the posterior on the latent space !!!?!??!!?! This sounds insane, but it actually works and works very well.
I love these extremely creative combinations of deep-learning with bayesian inference: there was a lot of it at nips this year and I’m very excited to see where it goes in the future.

# Nips day 1

December is a great month. Like most people, I of course look forward to christmas to top it all, but there is another great time during december: the nips conference, which is held in Barcelona this year. Nips is a great conference on machine learning, and it is a great pleasure to be able to aItend it. I’ll take this great excuse to revive this blog and share some of the things I’m learning here.

Day 1 action report.

Day 1 was dedicated to tutorials: long talks with the objective of introducing the audience to important topics of machine learning.

I first attended a tutorial on variational inference by David Blei, Shakir Mohamed and Rajesh Ranganath. It was really cool. They presented variational inference (i.e: minimizing the reverse-KL divergence inside some restricted class of distributions to the target distribution) in a way that made understanding everything that they do very clear. They first presented an overview of the Bayesian approach, then showed how variational inference can be used on almost-all situations, and finally showed some cool applications mixing Bayesian inference and deep nets. I don’t think their slides are online yet, sadly 😦 However, David Blei and co-authors have a review on the subject which seems worth checking out: https://arxiv.org/abs/1601.00670

The second exciting thing I got to learn about concerned gradient methods. I know very little about those, so I went to listen to Francis Bach and Suvrit Sra to learn more. Their talks were focused on showing the prowess of SVRGD (“stochastic variance reduction gradient descent”, if I’m not mistaken) when compared to alternatives. The key idea of SVRGD is:

• we do stochastic gradient descent
• but, we keep the gradients which we have computed in memory and, even though they are “stale”: they do not correspond to the point at which we are at, we follow the sum of all gradients we have in memory

This approach, surprisingly, works super well and just wipes the floor with SGD and deterministic gradient methods. I’ll be sure to read the corresponding papers in detail (some samples: https://arxiv.org/abs/1202.6258 , https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf ). Slides are available at: http://www.di.ens.fr/~fbach/fbach_tutorial_vr_nips_2016.pdf and http://www.di.ens.fr/~fbach/ssra_tutorial_vr_nips_2016.pdf .

I’m sure the rest of the conference will be at least as good !

# Legislating on statistical inference

Attention conservation notice: car insurance companies in France are forbidden
from adapting their premiums to the sex of the insured, even though women are
less likely to have accidents than men. This leads me to write long confusing
paragraphs on statistics versus racism / prejudice versus legislation.

I was recently told by a good friend that car insurance companies in France
recently lost the ability to adapt their premiums to the sex of the insured.
Insurance companies would like to be able to do this because women are less
likely to have accidents and so they should pay less insurance. More generally,
insurance is basically a form of betting: they are basically offering you a bet
on whether or not you will run into trouble. However, this bet is setup to
minimize your risk: if you do run into trouble, they give you money/aid to make
sure that the trouble is reduced.

Like all people offering bets, insurance companies are thus in the business of
performing accurate inference on whether or not a given person will run into
trouble. For health insurance, they want to predict whether or not they will
become sick (and also exactly how sick they might become: a flu is less
expensive to treat than cancer or diabetes).

This is where a very ugly little fact of life and of statistics rears its head.
This is something that I’m still grappling with so excuse any naive thoughts I
might have on this issue. The problem is the following: how do you respond when
someone observes that, for example, black people in the USA commit a
disproportionate fraction of crime ? (see this link) Does this observation then justify having racist thoughts ? I’d
like to answer a strong: “NO” to that question but I really can’t find a way to
say. I can’t find an argument where I’m not arguing that skin color is somehow
a variable that you shouldn’t base your inference on because it is “morally
illegal”. This troubles me greatly.

If this is indeed the direction we want to follow: some features of the data we
collect are “morally illegal”, then can we maybe try to collect all such
features ? Obviously, sex is also off the table (your algorithm shouldn’t be
sexist). Similarly, sexual orientation is also off. Age has a weirder status:
you can obviously be biased against young people and/or old people (and a lot
of people seem to be extremely biased based on age, which I find very weird),
but it somehow feels not as wrong to stereotype people based on age. This seems
to round up all the clearly illegal features. (One “-ism” that is clearly
missing from my list is classism: the prejudice against people that aren’t from
the right social background, like working class people. This is because I
couldn’t find any clear feature for class.)

We then run into muddy territory: the features that by themselves should be
innocent, but can be extremely indicative, by themselves or in combination, of
an illegal feature. Consider postal adress, for example: it is extremely
indicative of social status and possibly of age/sexual orientation (think about
the Village in New York, le Maris in Paris, etc). Similarly, names are also a
big problem: they are extremely indicative of ancestry. So on and so forth, we
find that a lot of features that might be available to insurance agencies are
maybe indicative enough that they can get a good inference on these illegal
features we were discussing before.

I’m not sure how to solve this issue: forbidding by law insurance companies to
use sex as a predictor is certainly one way, but I’m not sure if it’s really
the right solution: it feels more like playing judicial whack-a-mole than a
constructive general solution. I’m not even sure if there really is a problem.
I’d be curious to hear your thoughts.

# Using natural exponential families: the Chernoff concentration result

In my last post, I introduced a neat little trick: how to define the natural exponential family of a probability distribution $p(x)$. Let’s now see how we can use this trick to prove a powerful concentration result known as the Chernoff (or Cramer-Chernoff) theorem.

### The intuition

You should already know the Chebyshev concentration result: a random variable who has finite variance has most of its probability around its mean. The basic idea behind it is that, by bounding some upper-moment of the random variable, we force most of its mass to be tightly packed.

Now, consider a distribution $p(x)$ which has a finite moment generating function $M(t) = E( \exp(tx) )$. This represents a constraint on a statistic with a very fast growth rate, much faster than $x^2$, the statistic of the variance. This constraint should give us much tighter bounds on the probability distribution.

### The math

Let’s fix $E(x) = \mu_0$, and let’s seek to bound $p(x \geq \mu)$ for some $\mu \geq \mu_0$. Since we want to use the cool exponential families, let’s just stick a $\exp(tx)$ term in there and see if we can work with that:

$\displaystyle \int p(x) 1(x \geq \mu) dx = \\ M(t) \exp( - t\mu) \int \frac{p(x)\exp(tx)}{M(t)} \exp(-t(x-\mu)) 1(x\geq\mu) dx$

Let’s take the terms in order. First the $M(t) \exp( - t\mu)$ term. First, let’s make the cumulant generating function $C(t) = \log(M(t))$ appear:

$\displaystyle M(t) \exp( - t\mu) = \exp( C(t) - t \mu )$

To the trained eye, this looks good. Do you remember the dual $D(\mu)$ of $C(t)$ ? He is defined by $D(\mu) = - \inf_t ( C(t) - t \mu)$. We thus have a cool lower bound for this first term:

$\displaystyle M(t) \exp( - t\mu) \leq \exp( - D(\mu) )$

Furthermore, this lower bound has a very nice link with the natural exponential family: it is reached for the value of $t$ such that the $p(x|t)$ member of the exponential family has mean $\mu$.

Let’s now work on the second term: $\int \frac{p(x)\exp(tx)}{M(t)} \exp(-t(x-\mu)) 1(x\geq\mu) dx$. We recognize an expected value under $p(x|t)$ of the statistic $\exp(-t(x-\mu)) 1(x\geq\mu)$. This statistic has values in $[0,1]$ so we can upper-bound it by $1$. Combining this with the earlier upper-bound, we get that:

$\displaystyle p(x \geq \mu) \leq \exp( - D(\mu) )$

This is a pretty nice bound: it tells us that most of the probability mass is in a very tight neighborhood around the mean. But let’s try to go further, and lower-bound $p(x \geq \mu)$.

### A lower-bound

The trick to deriving the lower-bound is to first fix the value $t$ so that $p(x|t)$ has mean $\mu$. Let’s note this value $t^\star$. We have:

$\displaystyle p(x \geq \mu) = \exp( -D(\mu) ) \int p(x|t^\star) \exp( - t^\star (x - \mu)) 1(x \geq \mu) dx$

Suppose $p(x|t^\star)$ was extremely concentrated around his mean $\mu$, then the integral would basically be just $p(x \geq \mu|t^\star)$ which we could roughly estimate to $0.5$ if $p(x)$ was also somewhat symmetric. We would then have the following lower-bound:

$\displaystyle p(x \geq \mu) \geq 0.5 \exp( - D(\mu) )$

Basically, we can make this intuitive argument solid. Noting $\theta = - \log( \int p(x|t^\star) \exp( - t^\star (x - \mu)) 1(x \geq \mu) dx )$, we can prove that:

$\displaystyle \theta \leq p^{-1} t^\star \sqrt(var(x|t)) - \log(p)$

where we have used the shorthand $p = p(x \geq \mu | t^\star )$.

And we can also prove that $p$ is also lower-bounded (in fact, we could this in many different ways: the following bound is super rough but very general). If we note $m_4$ the fourth moment of $p(x| t^\star)$ then:

$\displaystyle p \geq \frac{var(x|t^\star)^2}{4 m_4}$

This is a coarse bound, but it does the job: we can compute both the variance and the fourth moment from the derivatives of either $M(t^\star)$ or $C(t^\star)$.

### Chernoff’s theorem

We can now finally state Chernoff’s theorem.

Thm: Let $X_1, X_2, \dots X_n$ by IID random variables with mean $\mu_0$. Note $D(\mu)$ the dual to their common cumulant generating function. Then $\forall \mu \geq \mu_0$:

$\displaystyle n^{-1} \text{ }\log p( n^{-1} \sum_i X_i \geq \mu) \rightarrow - D(\mu)$

Proof: The cumulant generating function of the sum $C_n(t)$ respects: $C_n(t) = n C(t/n)$, and $t^\star$ also depends on $n$ via: $t_{n}^\star = n t^\star$. Using that with the preceding relationships and some tedious math gives us the theorem (and proves that the error vanishes as $O(n^{-0.5})$ which isn’t the best speed but it’s ok).

### Conclusion

We now understand the Chernoff concentration result. The upper-bound is a bit easy and tells us that a random variable which has a moment-generating function is tightly concentrated around its mean which we completely expected. The lower-bound is much more interesting in that it guarantees that large deviations can happen, and tells us exactly with what rate these deviations will happen. It’s very important to not only know this result but to understand exactly how it was derived because both the upper and lower-bound could be potentially be useful in other situations.

Such concentration inequalities are great for statistics, because they enable us to prove that our estimators are guaranteed not to be too wrong. However, I find that I am slightly unsatisfied by this bound in that regard because there are no cases in which the bound is naturally saturated, which I always find a bit troubling.

### Two remarks

One thing that I’m a bit puzzled with is whether it’s actually the best to compute the lower-bound at $t^\star$. It’s not like we’re forced too. Could we possibly find a tighter bound by looking at some slightly different value of $t$ ?

Second remark: did you notice how it would seem that I’ve written down a proof that, for $\mu \rightarrow -\infty$ we would have $p(x\geq\mu) \rightarrow 0$ when it converges to 1 instead ? Don’t worry: the math is fine. This is simply because I never pointed out where exactly I used that $\mu \geq \mu_0$. Can you figure it out ?

As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.

# The natural exponential family of p(x)

Today, I’ll present a neat little trick to construct a family of probability distribution which all look similar to a base distribution $p(x)$ but which all have different means. And it isn’t $p(x-\mu)$ !!

Once that’s done, I’ll show you in another post how to use this neat trick to prove a cool concentration result.

## Definition

Let $p(x)$ be some probability distribution. Let’s assume that “p has a Moment Generating Function (MGF)”: there is some interval around 0 such that:

$\displaystyle M(t) = \int p(x) \exp( t x) dx$

is finite.

Then, we simply define the natural exponential family of $p(x)$ to be all the probability ditributions:

$\displaystyle p(x|t) \propto p(x) \exp(t x)$

for all values of $t$ such that $M(t)$ is finite. So it’s just $p(x)$ but shifted around a little bit.

You already know several examples of natural exponential families: if p is a Gaussian, then its natural exponential family is all the other Gaussians with the same variance. If p is a Gamma distribution, its exponential family is also simple: try and see if you can figure it out.

### Properties

First important property: all members of the family have a different mean. I’ll prove this in the next paragraph.

Second property: all “reachable” means are reached. If $p(x)$ has a bounded support, then all the values inside the support correspond to a member of the family which has that mean. If the support is $\mathbb R$, then all values are reached. I don’t know a proof of that though, so take that with a grain of salt.

As a corollary to this, we can parameterize the family not with $t$ but with $\mu = E(x|t)$: this characterizes a single member of the family !

### Relation to the cumulant generating function (CGF)

Since $p(x)$ has a MGF, it also has a CGF:

$\displaystyle C(t) = \log (M(t))$

Normally, when people talk about the derivatives of the MGF and the CGF, they only talk about the derivatives at $t=0$, but when we talk about the exponential family, we can finally talk about the derivatives at other positions !

The k-th derivative of $C(t)$ at some value of $t$ is, quite simply, the k-th cumulant of $p(x|t)$: one of the members of the exponential family !

Note that it’s not quite as a simple for the MGF $M(t)$: it’s derivatives need to be normalized by $M(t)$ before we obtain a uncentered moment … If this doesn’t speak to you, let’s illustrate it with the second derivative of both: would you rather compute $var(x|t)$ or $M(t) E(x^2|t)$ ?

By looking at C, we can see why each member of the family has a different mean: since $C$ is convex, its derivative is a bijection, and its derivative is: $C'(t) = E(x|t)$. QED.

From the CGF, we can define its Fenchel dual or its convex conjugate $D(\mu)$, which is sometimes called the large-deviations function (which is a bad name if you want my opinion). This dual D is not easy to understand the first time you meet him, but he isn’t that complicated. C is convex, so that its first derivative $C'(t) = E(x|t)$ is a bijection. D is also convex so its first derivative is also a bijection $D'(\mu)$, and it happens to be that $D'$ is exactly the inverse of $C'$ !

Once we are equiped with $D$, we are able to compute which member of the family has a given mean $\mu_0$: it’s simply the guy who has $t = D'(\mu_0)$. The value of $D(\mu)$ is also interesting in itself, because it is equal to $M(t)$: it gives us the value of the normalization.

This duality between C and D is the parallel of the dual parameterizations with $t$ and $\mu$ ! This is why this is cool. When we want to know the properties of $p(x|t)$, we compute things on C. When we want to learn about $p(x|\mu)$, we compute things on D.

### Divergence between members of the family

An important question is: how similar are two given members of the exponential family ? If we use the Kullback-Leibler divergence, it’s very easy to answer ! Try to prove that:

$\displaystyle KL(t_1,t_2) = C(t_2) - C(t_1) - C'(t_1) (t_2 - t_1)$

which is “obviously” positive: a convex function like C is always above its tangent at $t_2$. This isn’t exactly the simplest expression for the divergence, but if we now compute the symmeterized KL divergence, we get:

$\displaystyle KL(t_1,t_2) + KL(t_2,t_2) = [\mu_1-\mu_2][t_1 - t_2]$

As far as I know, this simple expression is absolutely useless, but it looks good, doesn’t it ?

### Natural exponential families and MGF convergence

Since I’m still quite a bit perplexed by topologies on probability distributions, I can’t resist but talk about them in this post.

Recall that MGF convergence is when a sequence of random variables (or of probability distributions) has a corresponding sequence of MGF which converge pointwise. This implies weak-convergence of the sequence towards the probability distribution with the limit MGF, and also convergence of all moments (among other statistics).

But we can now see this implies quite a bit more still. Consider the sequence of exponential families $p_n(x|t)$ (for $t \in [-r,r]$ the region where the MGF converge pointwise). For each value of $t$, we get MGF convergence to the limit function $p(x|t)$. Pretty cool, no ?

## Is it useful ?

The only real use I know for the exponential family is that it’s used for proving a concentration inequality for probability distributions which have a MGF. I’ll write another post about this since this one is getting a bit too long.

I also have some research of my own where this concept plays an important role, but I’m quite a bit behind on writing this down … $SOON^{TM}$

All in all, the natural exponential family is mostly as cool curiosity, but I hope you can find some use for it in your work.

As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.

# Struggles with probability topology: strengthening the weak-topology

Today, I continue my series on topologies on probability distributions. This series started when I realized that the common topologies (the weak and total-variation topologies) are extremely weak, in that they don’t imply convergence of moments.

In this series, I mostly looked at alternative topologies: the KL, the MGF, the Wasserstein, etc. Today, I will instead focus on how to strenghten the weak topology: adding conditions that ensure that the limit is better behaved that is should be from just the weak-topology.

## The weak-topology

Let’s first recall how the weak topology works. A random variable sequence $X_n$ converges to $X$ iff the expected value of any bounded continuous statistic $f(x)$ converges:

$\displaystyle E( f(X_n) ) \rightarrow E( f(X) )$

That’s the basic definition but the most useful characterization is that weak convergence is equivalent to convergence of all statistics of the form $\exp(itx)$ for $t \in \mathbb{R}$ (which you will recognize as the Fourier basis).

The weak topology is the most usual topology, but it has what I consider to be confusing behavior. Consider the following example:

$X_n$ is a mixture distribution. With probability $\frac{n-1}{n}$ pick a value from a Gaussian distribution centered at 0 with variance 1; with probability $\frac{1}{n}$, pick instead from a Gaussian with mean n. The mean of $X_n$ is always 1, and it’s variance grows to infinity with $n$. However, $X_n$ converges weakly to a Gaussian centered at 0. Faced with a situation like this, I’d rather say that $X_n \rightarrow Y$ where $Y$ is a degenerate pseudo-Gaussian with mean 1 and infinite variance OR refuse to acknowledge that $X_n$ converges.

### Ensuring some higher convergence

Weird convergence behavior under the weak topology always has the same structure as the example I’m presenting here. All statistics that grow faster than some limit statistic (in this case, everything that grows faster than $|x|$) see their expected value diverge, all statistics growing slower than the limit converge to the correct value. Thus, we can strenghten the weak-convergence of a given sequence by identifying the “critical statistic” of that sequence, or at least by finding statistics which grow slower than the critical statistic.

Let’s state this more precisely. If $X_n \rightarrow X$ weakly, and we know that some statistic is bounded in the sequence: $E( f(X_n) ) \leq K$, then that implies that the expected value of any statistic which is dominated by $f$ converges to the correct value. $\forall g,\text{ } \frac{g}{f} \leq K_g$ then:

$\displaystyle E( g(X_n) ) \rightarrow E( g(X) )$

Why is that the case ? Weird behavior under the weak topology is all about the fact that the weak topology controls poorly the tails of the $X_n$. The condition $E( f(X_n) ) \leq K$ tells us that whatever is happening in the tails is in $O(f^-1)$ so that we are safe for all functions which grow more slowly.

### Ensuring MGF convergence

The strongest topology I have talked about so far is the MGF topology. MGF convergence implies weak convergence, convergence of all moments, and convergence of all statistics that grow slower than some exponential.

With the above condition, we can ensure MGF convergence from weak convergence AND convergence (or boundedness) of $\exp( |r x|)$. We thus have a way to go from our weakest convergence to our strongest convergence in an instant.

### Can opposite violations happen ?

We know that, under the weak topology, sometimes some expected values are infinite when they should be finite: can weird violations also happen the opposite way, ie: $f$ such that $E( f(x) ) = \infty$ but $E( f(X_n)$ remains bounded ?

Thankfully, this can’t happen (for continuous $f$ at least). First, remark that we can restrict ourselves to positive $f(x)$ because for $f$ that aren’t absolutely convergent, I don’t know how to define $E( f(x) )$. Showing that $E( f(X_n) )$ will always be bigger than any bound we set is straighforward. Choose $B$, find a compact domain $I$ such that $E( f(x) 1(x \in I) ) > 2B$. Then $f(x) 1(x \in I)$ is a bounded and continuous statistic. By weak-convergence of $X_n$, we get our result.

## Complete agreement

Let $X_n$ be some weakly convergent sequence, with limit $X$. We saw that the only weird things that could happen was that some statistics that have finite expected value under $X$ have a limit expected value that doesn’t match (very possibly infinite). We can then define the absolute strongest convergence which would be that for any statistic which has finite expected value under $X$ the limit of the expected values converges to the correct value. In that case, for ALL statistics we have that the expected value under $X$ matches the limit of the expected value.

In practice, this complete agreement topology is all about tail behavior so it can probably be formalized in that way. I conjecture that something like:

$\displaystyle \frac{p(x)}{p_n(x)} \leq K_1 \text{ and }\frac{p_n(x)}{p(x)} \leq K_2$

and weak convergence would ensure “complete agreement” convergence. If you’re not convinced that my condition works, consider it for some examples like $p(x) = x^-3$ and $q(x)=\exp( - x)$. I’ll try to prove this in a further post.

### Summary and conclusion

Today, I presented ways to strenghten the weak topology. We found that having weak convergence and boundedness of the expected value of one statistic ensures the correct convergence of all statistics that grow slower. In particular, this enabled us to give a very simple condition to ensure MGF convergence from weak convergence. From this, we also saw that we could define a “complete agreement” topology which would ask that the expected value of any statistic are matched and conjectured that this simply requires weak convergence and similar tail behavior.

#### A technical note

I have avoided throughout this post to refer to equivalence classes of statistics because I was afraid that this might be confusing for those which are encountering this concept for the first time. For readers who understand (or think they understand) concepts like small-o, big-O and big-theta, try to rephrase what I said earlier with those terms.

As always, feel free to correct any inaccuracies, errors, spelling mistakes and to send comments on my email ! I’ll be glad to hear from you.