The Median is asymptotically Gaussian

The other day, I came across a really basic fact, which I should have known a long time ago: the empirical median is approximately distributed as a Gaussian in the large-data limit ! Let me share the proof with you: hopefully this will make me remember both the proof and the theorem in the future.

Let’s start by stating the theorem. Let $X_i$ be $2n+1$ IID random variables, and let $\hat{m}$ be their empirical median (note that since we have an odd number of datapoints, the empirical is straightforward to find: just order the $X_i$ and find the value at index $n+1$ in the ordered list). The theorem states that, in the limit of a large dataset $n \rightarrow \infty$, the empirical median has a Gaussian distribution. Even better: we can know exactly what kind of random variable it is: it corresponds to a Beta distribution with parameters $(\alpha = n+1, \beta = n+1)$ deformed by the inverse of the CDF of the $X_i$:

$\displaystyle \widehat{m} = F^{-1} ( \beta )$

In order to prove this result, let’s first focus on the case in which the $X_i$ are uniform random variables. Let’s compute the probability $P(\widehat{m} = x)$. We have $2n+1$ possibilities for the index of the median. Thus, we can focus on the case in which $X_1$ is the median:

$\displaystyle p(\widehat{m} = x) = p(\widehat{m} = X_1 = x)$

We then need to worry about the other indexes. $n$ points need to be above $x$ and $n$ need to below. This gives $\binom{2n}{n}$ possible repartitions. We can once more focus on a single possibility: $X_2 \dots X_{n+1}$ are below $x$ and $X_{n+2} \dots X_{2n+1}$ are above.

$\displaystyle p(\widehat{m} = x) = (2n+1)\binom{2n}{n} p(\widehat{m} = X_1 = x \text{ AND } X_2 \dots X_{n+1} < x \text{ AND } X_{n+2} \dots X_{2n+1} > x)$

Finally, all the $X_i$ are IID independent random variables. The probability above is thus straightforward to compute:

$\displaystyle p(\widehat{m} = x) =(2n+1)\binom{2n}{n} p(X_1 = x) \prod_{i=2}^{n+1} p(X_i < x) \prod_{j=n+2}^{2n+1} p(X_j>x)$

$\displaystyle p(\widehat{m} = x) =(2n+1) \binom{2n}{n} (1-x)^n x^n$

$\displaystyle p(\widehat{m} = x) = \frac{(2n+1)!}{n! n!} (1-x)^n x^n$

which we recognize as a beta distribution with parameters $(\alpha = n+1, \beta = n+1)$.

Ok. Now we know what happens for uniform distributions. How can we extend that result to the general case? In order to do so, we have to remember that any random variable can be constructed from a uniform distribution using the inverse of its CDF. Thus, we can construct the $X_i$ as:

$\displaystyle X_i = F^{-1}(U_i)$

Furthermore, the function $F^{-1}$ is monotonous. Thus, the median of the $X_i$ is the image of the median of $U_i$. Since the median of the $U_i$ has a Beta distribution, this means that the median of the $X_i$ has a deformed Beta distribution:

$\displaystyle \hat{m} = F^{-1} ( \beta )$

You might notice that we haven’t talked about Gaussians yet, but we’re almost there. Beta distributions become Gaussian with variance tending to 0 in the limit $\alpha \rightarrow \infty, \beta \rightarrow \infty$ while the ratio $\alpha / \beta$ stays constant. This is what happens here when $n \rightarrow \infty$. We have thus proved that in the uniform case, the median becomes Gaussian. Furthermore, because the variance of the Beta also goes to 0 as $n$ grows, the non-linear function $F^{-1}$ becomes close to linear and the median becomes close to Gaussian.

Note that this property holds in general for all empirical quantiles of IID datapoints. Would you be able to prove it? You just have to slightly modify the steps which I have presented here.

Lessons from my first course (2)

Three months later, I’m finally coming back to this blog to finish gathering my thoughts on my first class. And since I took such a long break from my blog, the spring semester is now over, meaning that I can now also reflect on my second and third classes (which went better I think?)

In my preceding post, I tried to highlight what went right and wrong. Now, I’ll try to understand what I can do to make my next classes better by analyzing why things went the way they did.

One big issue during the first class was that I ended the semester being extremely tired. This happened once more during this spring semester. Part of it is understandable: these are my first courses, and I have a lot to learn and so preparing each class takes me a lot of time. Part of the solution is thus going to come from my summer resolution: finishing august with all my material ready for the classes of the autumn semester, and having the material for the spring classes ready by the end of january. Hopefully, I can then use only half a day each week preparing for each class. Hopefully, this will make me less stressed and less worn out when the end of the semester comes.

The second big issue was that I was blindsided by the fact that the students didn’t like the class. That’s pretty easy to fix, however. What I did during this semester was that I asked twice the student representatives to gather feedback from their peers so that I could get a better picture of how they felt. I’ll try to up this to three times for my next classes: once per month. I think this is very important since I feel that students are shy in expressing the problems they have with a class. They are willing to give feedback: I just need to ask it from them. Hopefully, this will be unbiased feedback. My biggest worry is that they won’t be honest with me. I’ll try to be careful here.

On top of this, I feel like I have learned a little about good teaching practices. I’ll take the time to reflect on that in another blog post. Armed with this new knowledge, I hope that my classes next year can be better!

Lessons from my first course

Exam session for the autumn semester is over. This seems like a fine time to reflect over my first course.

This first course consisted of teaching second year bachelor students, in non-mathematical studies, an introduction to probability and statistics. The content of the class was essentially probability up to the central limit theorem and an introduction to the concept of statistics, with mostly Gaussian models. The hardest statistical topic was the Student t-test for linear regression.

Overall, I think I did a barely passing job for this course. It is of course understandable that not everything can go alright for a first course, but that doesn’t mean that I shouldn’t be honest with myself, my students and the rest of the department. Let’s try to list what went wrong and what went OK.

Things that went well:

• I was a very motivated teacher, and I brought much more energy to my class than most teachers
• I was very available to my students
• I remained (somewhat; you’ll see in a second) attentive to how the course was progressing, and I think that I improved my teaching quite a lot over the semester
• I spoke well
• I didn’t write as poorly as I expected of myself. I still need to work on this point quite a lot

Now, what didn’t go well:

• I was too ambitious for a first course. I redid everything from scratch, when I shouldn’t have. This caused me to commit many mistakes. Namely,
• I didn’t take into account enough what students need. They need a lot of structure: every piece of knowledge should clearly be labeled according to how important it is to understand, etc.
• I crafted a course that I would have liked as a student. This is a bad idea since I was always a very mathematically-minded student, and I was a very good student. My course should be aimed instead at a more practical-level, and towards a slower-pace.
• I completely neglected exercises. They are an integral part of the learning experience and are at least as important as the lectures. This caused students to resent the exercise sessions, and minimized how much they learnt from the class.
• Even though I identified some of these flaws along the way, most of them completely blindsinded me, and only came to surface during the anonymous review of the course by the students. This means I failed to gather meaningful feedback from the class. These issues should have been identified much earlier during the semester, which would have enabled me to correct them much earlier.
• I handled the pressure of teaching pretty poorly, especially at the end of the semester. I’m a very anxious guy, so it’s not surprising that I was stressed, but this went beyond stress. Teaching a semester is a bit like running a marathon: you can’t give all you have during the first half and finish the race crawling, you have to be regular. I need to pace my energy more in the future.

In the next few days, I’ll try to reflect on identifying why things went wrong (and right), and what I will do to make my future classes better.

Nips workshops day 2

Here is what I learned on the second day of the nips workshops where I went to the deep bayesian networks workshops.

I feel like I should take a second to define precisely what the workshop is about. In a nutshell, it’s about trying to combine Bayesian methods and deep neural networks. On paper this seems like a great idea to augment Bayesian methods with the flexibility of neural networks. However, it is a path that really emphasizes the key weakness of Bayesian methods: the fact that it is riddled with computational problems.
The initial idea that people have been using to deal with the computational problems is “variational inference” (minimizing the “reverse” Kullback-Leibler divergence $KL(q,p)$)

The two highlights of the day were a first talk by Zoubin Ghahramani who gave a nice history lesson on Bayesian neural networks. We do tend to get a little bit caught up in what we are doing, and it’s great to have these talks from time to time to remember the giants whose shoulders we are standing on. The second highlight was a tribute to one of those giants, David Mackay who passed away earlier during the year, by Ryan Adams. I didn’t know Prof. Mackay, but this was a very moving talk and it painted a very vivid picture of him. He seemed like a great guy, and an even greater scientist. He will be missed.

An intriguing idea (which was presented several times during the whole conference) was entitled “Stein variational inference”. It consists in finding a cloud of points to approximate a target probability distribution according to an objective that is reminescent of $KL(q,p)$. I’m not sure how much this differs from using a sparse kernel-based approximation of the log probability of the target distribution. They also had a deep network method that was reminescent of generative adverserial networks.
This has a lot of interesting flavor with the combination of Stein’s method and variational inference and kernel-methods so I definitely need to look at it further

At this point, I was pretty saturated so I just couldn’t follow anymore, but the panel discussion which closed the workshop was pretty great. I’m guessing that these panels are growing on me after all. I really didn’t like them last year at nips, as well as the few that were in the cosyne workshops (I remember one at cosyne that was particularly unproductive). It really depends on the panel and the public’s comment… but it can be great. Overall, I liked the first day of the workshops more, but I’m guessing that, quite simply, that workshop simply aligned a bit more with my interests than today’s one. It was still great though.

Nips workshops day 1

Here is what I learned during the first day of workshops: at the approximate bayesian inference workshop !

The day started by a very cool idea by Surya Ganguli, which showed how he could train a deep network to learn how to reverse the flow of time! It’s actually a bit less impressive than it sounds, but stil super cool. He was looking at how to model some data set in an unsupervised fashion. His idea was the following. You first design a dynamical process that will decay each data point into white noise (or whatever the appropriate equivalent of that is) and then training a deep-net to reverse the flow of time: take as input a sample from white-noise and return a prediction for the corresponding data point. He then gets machine that can approximately sample from the data distribution ! that was great.

I then gave a talk about expectation-propagation and how it can be viewed as a variant of gradient descent, shedding quite a bit of light on this method. I need to write a blog-post on that.

Closing the morning session was a pannel on the computational aspects of approximate inference methods. Pretty interesting. I really liked the fact that the panel was pretty open to questions from the audience. I like it a bit less when panels are very structured: in most cases, I don’t think this leads to very lively discussions.

During the lunch break, there was an awesome number of super cool posters (where were they during the poster sesssions of the main conference!?) which I’ll come back to in a future post if I have time.  People have so many super-cool ideas, and actually make them work! I love the energy-feeling you get from places like nips.

In the afternoon, we had a great statistical talk by Jeffrey Regier. They managed to run a model to infer the position, angle, color, etc of all stars and galaxies in the night sky (500TB). That. Was. Cool. His model had a large number of variables, interacting in interesting ways, and was trying to model a giant image of the sky. This was way cooler than my description of it.

Closing the day was a panel on the foundations and futures of variational inference. It was pretty interesting, but it was very directed. Richard Turner had a pretty great recap on the advantages of expectation propagation (oh yeah!), Philipp Hennig had a great introduction to probabilistic numerical methods. When probed about the combination of deep nets and bayesian inference, Ryan Adams said that he was hyped but that he also liked probabilistic graphical models and that he wants to combine so that both are playing to their strengths. The final panel made a great point about the fact that, for variational inference, we optimize the posterior without considering what we do afterwards with the approximate posterior / what task we solve. This is a great point, though I’m not sure how to find answers to that. Indeed, if my loss function is $(\theta - \theta_0)^8$, I’m sure that my approximation algorithm should take this into account, but how? Puzzling …

Nips day 3

Here is what I learned during the third day of nips.

The day started by a very very cool talk by Kyle Cranmer from CERN. He did a great job explaining how physicists performed the statistical analysis behind the discovery of the Higgs boson. His talk went in great detail but remained extremely easy to understand and was absolutely great (probably my favorite of the whole conference). He then also had a few great proposals for interesting problems that the community might be interested in modeling.

In the afternoon, there was a very entertaining talk by Marc Raibert from the boston dynamics company. They do the wonderful robots that you probably have seen on youtube. I was a little bit sad that he didn’t give us too much insight into how they actually make the robots move but he did show a lot and had lots of entertaining videos so he gets a shout-out.

Another cool idea in the afternoon, that was a lot more focused on practical theory, was presented by Damien Scieur, Alexandre d’Aspremont and Francis Bach. They showed a post-processing method for optimizing systems. Their method is a very small and cheap trick that you can plugin to any optimization method to make it better. Overall, it was pretty cool, but I’m not sure if I understand optimization well enough to give the best commentary on that.

Nips day 2

This is the second day of nips. Here is what I learned.

In the morning, Matthias Seeger (Amazon) presented the model that they use to forecast demand. That was very interesting. That was basically a linear regression with handcrafted features (he didn’t go in detail there about how they were crafted). The regression itself was also interesting: they want to regress the number of sales in the day to predictors; they did so by having two logistic regression for determining whether you would get 1 sale or more, 2 sales or more, and then a Poisson likelihood for all sales higher than that. This makes the system able to model the fact that the first few values are much higher than they should be in a Poisson model, while modeling the tail behavior in a simple way.
Another interesting bit: they have periods where products aren’t in stock, so they can’t be sold. If they ignore these periods, their model goes wrong (it under-estimates the desirability of the product). If they model these correctly: they are no sales because the product is out-of-stock, then their model is way better. As always, the generative model should match the physical realities of the data.

In the afternoon, there was also a great talk by Saket Navlakha from the Salk institute. I’m normally not a great fan of bio-mimetics approach (trying to use biology to improve machine learning techniques, or, more generally, engineering technology; in most cases, the constraints we deal with as engineers are extremely different from the constraints that biology is dealing with) but he had two great examples of exactly that. The first one consisted of creating a communication network by removing edges from a fully-connected (or just very dense) network.

Final great thing was a very particular mixture of Bayesian methods and deep learning by Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet and Ole Winther. They wanted to model sequential data (recordings of speech). A good idea for a model would use a latent markov-chain, conditional on which the sound at each time step is independent (a hidden-markov-chain model). Normally, people would use a very simple latent model: a Kalman filter. This doesn’t work well at all as it as too simple dynamics compared to the actual dynamics of the signal being modelled. What Marco and co-authors propose to use instetad is to use a deep-network to represent the dynamics of the latent space oO. They then are able to perform inference on the weights of their deep network !!?! and finally they learn to use an inference network to take a soundwave as an input and return a variational inference approximation of the posterior on the latent space !!!?!??!!?! This sounds insane, but it actually works and works very well.
I love these extremely creative combinations of deep-learning with bayesian inference: there was a lot of it at nips this year and I’m very excited to see where it goes in the future.