Reading: “All of statistics” (Wasserman) a second time

I’ve spent almost this whole week reading (and now my head hurts. However, since correlation is not causation, I don’t feel like concluding a causal relationship between the two).

My latest read was the classic “All of statistics” by Wasserman which is a great statistics book. Its objective is to give a compact introduction to all / most of statistics in a 250 pages book, and it does so spectacularly well. If you want to understand stats, and have a good understanding of proability and analysis, I honestly can’t recommend enough “all of statistics”. This is a particularly good book for computer science / machine learning students, since they might have picked up some stats concept “on the job” but never had a formal presentation of the full framework, and they likely have all of the background knowledge.

Of course, in order to make the book compact, some things had to be cut: proofs and examples. The book offers proofs of only the most important theorems, the rest being left as an exercise to the reader. There are a few examples, but probably way less than in most books. However, and this might be a very personal opinion, I didn’t mind at all: proofs are important, but they also distract from the flow of learning about a new field such as stat: one can always come back to a more detailled reference book if one really needs to know all the details.

In a nutshell: “All of statistics” is a great book which I warmly recommend. The only drawback is that Prof. Wasserman doesn’t give the most honest presentation of bayesian inference (which, everybody knows, is the best statistical paradigm!).


Reading: “Teaching statistics” (Gelman & Nolan)

I got some more reading done in the last few days. This time I was reading “Teaching statistics: a bag a tricks” by A. Gelman and D. Nolan. The book aims at giving lots of examples and ideas to make an introductory statistics lecture better (and it also has chapters at the end that give ideas for some more advanced classes. Critically, they have one on teaching Bayesian inference, which I’ll come back to when I’m preparing that class).


First, let me tell you that this book is really good. Gelman and Nolan are clearly very good teachers and they make a really good job of sharing both the very little and the very big pieces of knowledge that they have acquired over the years. The book mostly focuses on activities / examples that can be used to engage with students. The objective is to make statistics more accessible so that the students retain more and get a more intuitive understanding of what statistics is about.

I’m guessing that, to the more experienced teacher, this book might feel light, as they would have already had some (most?) of the ideas that are presented. However, even if you have already had the intuitions, having them laid out clearly by Gelman and Nolan would still be beneficial in my opinion.


The on issue I have with the book is that, because the authors provide such high-quality courses to their students, it seems a little hard for me to adapt their advice to my class in which I do not have the resources for student projects, programming assignments, etc. I think it makes me a little bit jealous: I’d love to be able to do all of that ! I’m going to need to think hard about how to still convey to my students some of the “softer” aspects of statistics: how to collect data in a “right” way, how to correctly summarize information in a graph, etc.

Reading: “Statistique pour mathématiciens” (V. Panaretos)

I’m going to be using the summer break to get some reading done. I’ll do one or more post on every book I read so that I can share a little bit of what I learn.

Today, I’ll discuss “Statistique pour mathématiciens, un premier cours rigoureux” which is a second year Bachelor course aimed at mathematicians which is used in EPFL. I’m reading it because I’m teaching the same course but for engineers next fall, and I’m hoping to find some good ideas about how to explain statistics.

As I write the post, I have read about half of the book (which is fairly short: 240 pages, including a 100 page-long appendix with proofs and exercise corrections). However, it is extremely dense and full of content. It reminds me of “All of statistics” by Wasserman in that this book could be a very good first dive into statistics for someone who feels comfortable with probability theory but who has no knowledge of statistics. Of course, given that it is much shorter that “All of statistics”, it doesn’t have the same depth or breadth, but this book has all the essential points of statistics.

One section which I particularly liked was the introduction. It seems to me that all of the major difficulties of statistics are conceptual: once you understand what you are trying to accomplish, and the way you should frame your questions, the math part of statistics is (should be?) straightforward compared to other classes. However, these conceptual difficulties are either glossed over or very poorly presented. Panaretos does a great job of explaining exactly what statistics is and how it proceeds.

One particularly great example of this is when he describes how one should choose a model to describe a dataset: through a handful of clear and (at least for engineers) familiar examples, he gives great examples of how scientific, philosophical, or exploratory approaches can be used and combined to formulate an appropriate model.


I look forward to finishing reading this very nice book !

The Median is asymptotically Gaussian

The other day, I came across a really basic fact, which I should have known a long time ago: the empirical median is approximately distributed as a Gaussian in the large-data limit ! Let me share the proof with you: hopefully this will make me remember both the proof and the theorem in the future.


Let’s start by stating the theorem. Let X_i be 2n+1 IID random variables, and let \hat{m} be their empirical median (note that since we have an odd number of datapoints, the empirical is straightforward to find: just order the X_i and find the value at index n+1 in the ordered list). The theorem states that, in the limit of a large dataset n \rightarrow \infty, the empirical median has a Gaussian distribution. Even better: we can know exactly what kind of random variable it is: it corresponds to a Beta distribution with parameters (\alpha = n+1, \beta = n+1) deformed by the inverse of the CDF of the X_i:

\displaystyle \widehat{m} = F^{-1} ( \beta )


In order to prove this result, let’s first focus on the case in which the X_i are uniform random variables. Let’s compute the probability P(\widehat{m} = x). We have 2n+1 possibilities for the index of the median. Thus, we can focus on the case in which X_1 is the median:

\displaystyle p(\widehat{m} = x) = p(\widehat{m} = X_1 = x)

We then need to worry about the other indexes. n points need to be above x and n need to below. This gives \binom{2n}{n} possible repartitions. We can once more focus on a single possibility: X_2 \dots X_{n+1} are below x and X_{n+2} \dots X_{2n+1} are above.

\displaystyle p(\widehat{m} = x) = (2n+1)\binom{2n}{n} p(\widehat{m} = X_1 = x \text{ AND } X_2 \dots X_{n+1} < x \text{ AND } X_{n+2} \dots X_{2n+1} > x)

Finally, all the X_i are IID independent random variables. The probability above is thus straightforward to compute:

\displaystyle p(\widehat{m} = x) =(2n+1)\binom{2n}{n}  p(X_1 = x) \prod_{i=2}^{n+1} p(X_i < x) \prod_{j=n+2}^{2n+1} p(X_j>x)

\displaystyle p(\widehat{m} = x) =(2n+1) \binom{2n}{n}  (1-x)^n x^n

\displaystyle p(\widehat{m} = x) = \frac{(2n+1)!}{n! n!}  (1-x)^n x^n

which we recognize as a beta distribution with parameters (\alpha = n+1, \beta = n+1).


Ok. Now we know what happens for uniform distributions. How can we extend that result to the general case? In order to do so, we have to remember that any random variable can be constructed from a uniform distribution using the inverse of its CDF. Thus, we can construct the X_i as:

\displaystyle X_i = F^{-1}(U_i)

Furthermore, the function F^{-1} is monotonous. Thus, the median of the X_i is the image of the median of U_i. Since the median of the U_i has a Beta distribution, this means that the median of the X_i has a deformed Beta distribution:

\displaystyle \hat{m} = F^{-1} ( \beta )


You might notice that we haven’t talked about Gaussians yet, but we’re almost there. Beta distributions become Gaussian with variance tending to 0 in the limit \alpha \rightarrow \infty, \beta \rightarrow \infty while the ratio \alpha / \beta stays constant. This is what happens here when n \rightarrow \infty. We have thus proved that in the uniform case, the median becomes Gaussian. Furthermore, because the variance of the Beta also goes to 0 as n grows, the non-linear function F^{-1} becomes close to linear and the median becomes close to Gaussian.


Note that this property holds in general for all empirical quantiles of IID datapoints. Would you be able to prove it? You just have to slightly modify the steps which I have presented here.

Lessons from my first course (2)

Three months later, I’m finally coming back to this blog to finish gathering my thoughts on my first class. And since I took such a long break from my blog, the spring semester is now over, meaning that I can now also reflect on my second and third classes (which went better I think?)

In my preceding post, I tried to highlight what went right and wrong. Now, I’ll try to understand what I can do to make my next classes better by analyzing why things went the way they did.

One big issue during the first class was that I ended the semester being extremely tired. This happened once more during this spring semester. Part of it is understandable: these are my first courses, and I have a lot to learn and so preparing each class takes me a lot of time. Part of the solution is thus going to come from my summer resolution: finishing august with all my material ready for the classes of the autumn semester, and having the material for the spring classes ready by the end of january. Hopefully, I can then use only half a day each week preparing for each class. Hopefully, this will make me less stressed and less worn out when the end of the semester comes.

The second big issue was that I was blindsided by the fact that the students didn’t like the class. That’s pretty easy to fix, however. What I did during this semester was that I asked twice the student representatives to gather feedback from their peers so that I could get a better picture of how they felt. I’ll try to up this to three times for my next classes: once per month. I think this is very important since I feel that students are shy in expressing the problems they have with a class. They are willing to give feedback: I just need to ask it from them. Hopefully, this will be unbiased feedback. My biggest worry is that they won’t be honest with me. I’ll try to be careful here.

On top of this, I feel like I have learned a little about good teaching practices. I’ll take the time to reflect on that in another blog post. Armed with this new knowledge, I hope that my classes next year can be better!

Lessons from my first course

Exam session for the autumn semester is over. This seems like a fine time to reflect over my first course.

This first course consisted of teaching second year bachelor students, in non-mathematical studies, an introduction to probability and statistics. The content of the class was essentially probability up to the central limit theorem and an introduction to the concept of statistics, with mostly Gaussian models. The hardest statistical topic was the Student t-test for linear regression.

Overall, I think I did a barely passing job for this course. It is of course understandable that not everything can go alright for a first course, but that doesn’t mean that I shouldn’t be honest with myself, my students and the rest of the department. Let’s try to list what went wrong and what went OK.

Things that went well:

  • I was a very motivated teacher, and I brought much more energy to my class than most teachers
  • I was very available to my students
  • I remained (somewhat; you’ll see in a second) attentive to how the course was progressing, and I think that I improved my teaching quite a lot over the semester
  • I spoke well
  • I didn’t write as poorly as I expected of myself. I still need to work on this point quite a lot


Now, what didn’t go well:

  • I was too ambitious for a first course. I redid everything from scratch, when I shouldn’t have. This caused me to commit many mistakes. Namely,
  • I didn’t take into account enough what students need. They need a lot of structure: every piece of knowledge should clearly be labeled according to how important it is to understand, etc.
  • I crafted a course that I would have liked as a student. This is a bad idea since I was always a very mathematically-minded student, and I was a very good student. My course should be aimed instead at a more practical-level, and towards a slower-pace.
  • I completely neglected exercises. They are an integral part of the learning experience and are at least as important as the lectures. This caused students to resent the exercise sessions, and minimized how much they learnt from the class.
  • Even though I identified some of these flaws along the way, most of them completely blindsinded me, and only came to surface during the anonymous review of the course by the students. This means I failed to gather meaningful feedback from the class. These issues should have been identified much earlier during the semester, which would have enabled me to correct them much earlier.
  • I handled the pressure of teaching pretty poorly, especially at the end of the semester. I’m a very anxious guy, so it’s not surprising that I was stressed, but this went beyond stress. Teaching a semester is a bit like running a marathon: you can’t give all you have during the first half and finish the race crawling, you have to be regular. I need to pace my energy more in the future.


In the next few days, I’ll try to reflect on identifying why things went wrong (and right), and what I will do to make my future classes better.

Nips workshops day 2

Here is what I learned on the second day of the nips workshops where I went to the deep bayesian networks workshops.

I feel like I should take a second to define precisely what the workshop is about. In a nutshell, it’s about trying to combine Bayesian methods and deep neural networks. On paper this seems like a great idea to augment Bayesian methods with the flexibility of neural networks. However, it is a path that really emphasizes the key weakness of Bayesian methods: the fact that it is riddled with computational problems.
The initial idea that people have been using to deal with the computational problems is “variational inference” (minimizing the “reverse” Kullback-Leibler divergence KL(q,p))

The two highlights of the day were a first talk by Zoubin Ghahramani who gave a nice history lesson on Bayesian neural networks. We do tend to get a little bit caught up in what we are doing, and it’s great to have these talks from time to time to remember the giants whose shoulders we are standing on. The second highlight was a tribute to one of those giants, David Mackay who passed away earlier during the year, by Ryan Adams. I didn’t know Prof. Mackay, but this was a very moving talk and it painted a very vivid picture of him. He seemed like a great guy, and an even greater scientist. He will be missed.

An intriguing idea (which was presented several times during the whole conference) was entitled “Stein variational inference”. It consists in finding a cloud of points to approximate a target probability distribution according to an objective that is reminescent of KL(q,p). I’m not sure how much this differs from using a sparse kernel-based approximation of the log probability of the target distribution. They also had a deep network method that was reminescent of generative adverserial networks.
This has a lot of interesting flavor with the combination of Stein’s method and variational inference and kernel-methods so I definitely need to look at it further

At this point, I was pretty saturated so I just couldn’t follow anymore, but the panel discussion which closed the workshop was pretty great. I’m guessing that these panels are growing on me after all. I really didn’t like them last year at nips, as well as the few that were in the cosyne workshops (I remember one at cosyne that was particularly unproductive). It really depends on the panel and the public’s comment… but it can be great. Overall, I liked the first day of the workshops more, but I’m guessing that, quite simply, that workshop simply aligned a bit more with my interests than today’s one. It was still great though.