# Reading: “Teaching statistics” (Gelman & Nolan)

I got some more reading done in the last few days. This time I was reading “Teaching statistics: a bag a tricks” by A. Gelman and D. Nolan. The book aims at giving lots of examples and ideas to make an introductory statistics lecture better (and it also has chapters at the end that give ideas for some more advanced classes. Critically, they have one on teaching Bayesian inference, which I’ll come back to when I’m preparing that class).

First, let me tell you that this book is really good. Gelman and Nolan are clearly very good teachers and they make a really good job of sharing both the very little and the very big pieces of knowledge that they have acquired over the years. The book mostly focuses on activities / examples that can be used to engage with students. The objective is to make statistics more accessible so that the students retain more and get a more intuitive understanding of what statistics is about.

I’m guessing that, to the more experienced teacher, this book might feel light, as they would have already had some (most?) of the ideas that are presented. However, even if you have already had the intuitions, having them laid out clearly by Gelman and Nolan would still be beneficial in my opinion.

The on issue I have with the book is that, because the authors provide such high-quality courses to their students, it seems a little hard for me to adapt their advice to my class in which I do not have the resources for student projects, programming assignments, etc. I think it makes me a little bit jealous: I’d love to be able to do all of that ! I’m going to need to think hard about how to still convey to my students some of the “softer” aspects of statistics: how to collect data in a “right” way, how to correctly summarize information in a graph, etc.

# Reading: “Statistique pour mathématiciens” (V. Panaretos)

I’m going to be using the summer break to get some reading done. I’ll do one or more post on every book I read so that I can share a little bit of what I learn.

Today, I’ll discuss “Statistique pour mathématiciens, un premier cours rigoureux” which is a second year Bachelor course aimed at mathematicians which is used in EPFL. I’m reading it because I’m teaching the same course but for engineers next fall, and I’m hoping to find some good ideas about how to explain statistics.

As I write the post, I have read about half of the book (which is fairly short: 240 pages, including a 100 page-long appendix with proofs and exercise corrections). However, it is extremely dense and full of content. It reminds me of “All of statistics” by Wasserman in that this book could be a very good first dive into statistics for someone who feels comfortable with probability theory but who has no knowledge of statistics. Of course, given that it is much shorter that “All of statistics”, it doesn’t have the same depth or breadth, but this book has all the essential points of statistics.

One section which I particularly liked was the introduction. It seems to me that all of the major difficulties of statistics are conceptual: once you understand what you are trying to accomplish, and the way you should frame your questions, the math part of statistics is (should be?) straightforward compared to other classes. However, these conceptual difficulties are either glossed over or very poorly presented. Panaretos does a great job of explaining exactly what statistics is and how it proceeds.

One particularly great example of this is when he describes how one should choose a model to describe a dataset: through a handful of clear and (at least for engineers) familiar examples, he gives great examples of how scientific, philosophical, or exploratory approaches can be used and combined to formulate an appropriate model.

I look forward to finishing reading this very nice book !

# The Median is asymptotically Gaussian

The other day, I came across a really basic fact, which I should have known a long time ago: the empirical median is approximately distributed as a Gaussian in the large-data limit ! Let me share the proof with you: hopefully this will make me remember both the proof and the theorem in the future.

Let’s start by stating the theorem. Let $X_i$ be $2n+1$ IID random variables, and let $\hat{m}$ be their empirical median (note that since we have an odd number of datapoints, the empirical is straightforward to find: just order the $X_i$ and find the value at index $n+1$ in the ordered list). The theorem states that, in the limit of a large dataset $n \rightarrow \infty$, the empirical median has a Gaussian distribution. Even better: we can know exactly what kind of random variable it is: it corresponds to a Beta distribution with parameters $(\alpha = n+1, \beta = n+1)$ deformed by the inverse of the CDF of the $X_i$:

$\displaystyle \widehat{m} = F^{-1} ( \beta )$

In order to prove this result, let’s first focus on the case in which the $X_i$ are uniform random variables. Let’s compute the probability $P(\widehat{m} = x)$. We have $2n+1$ possibilities for the index of the median. Thus, we can focus on the case in which $X_1$ is the median:

$\displaystyle p(\widehat{m} = x) = p(\widehat{m} = X_1 = x)$

We then need to worry about the other indexes. $n$ points need to be above $x$ and $n$ need to below. This gives $\binom{2n}{n}$ possible repartitions. We can once more focus on a single possibility: $X_2 \dots X_{n+1}$ are below $x$ and $X_{n+2} \dots X_{2n+1}$ are above.

$\displaystyle p(\widehat{m} = x) = (2n+1)\binom{2n}{n} p(\widehat{m} = X_1 = x \text{ AND } X_2 \dots X_{n+1} < x \text{ AND } X_{n+2} \dots X_{2n+1} > x)$

Finally, all the $X_i$ are IID independent random variables. The probability above is thus straightforward to compute:

$\displaystyle p(\widehat{m} = x) =(2n+1)\binom{2n}{n} p(X_1 = x) \prod_{i=2}^{n+1} p(X_i < x) \prod_{j=n+2}^{2n+1} p(X_j>x)$

$\displaystyle p(\widehat{m} = x) =(2n+1) \binom{2n}{n} (1-x)^n x^n$

$\displaystyle p(\widehat{m} = x) = \frac{(2n+1)!}{n! n!} (1-x)^n x^n$

which we recognize as a beta distribution with parameters $(\alpha = n+1, \beta = n+1)$.

Ok. Now we know what happens for uniform distributions. How can we extend that result to the general case? In order to do so, we have to remember that any random variable can be constructed from a uniform distribution using the inverse of its CDF. Thus, we can construct the $X_i$ as:

$\displaystyle X_i = F^{-1}(U_i)$

Furthermore, the function $F^{-1}$ is monotonous. Thus, the median of the $X_i$ is the image of the median of $U_i$. Since the median of the $U_i$ has a Beta distribution, this means that the median of the $X_i$ has a deformed Beta distribution:

$\displaystyle \hat{m} = F^{-1} ( \beta )$

You might notice that we haven’t talked about Gaussians yet, but we’re almost there. Beta distributions become Gaussian with variance tending to 0 in the limit $\alpha \rightarrow \infty, \beta \rightarrow \infty$ while the ratio $\alpha / \beta$ stays constant. This is what happens here when $n \rightarrow \infty$. We have thus proved that in the uniform case, the median becomes Gaussian. Furthermore, because the variance of the Beta also goes to 0 as $n$ grows, the non-linear function $F^{-1}$ becomes close to linear and the median becomes close to Gaussian.

Note that this property holds in general for all empirical quantiles of IID datapoints. Would you be able to prove it? You just have to slightly modify the steps which I have presented here.

# Lessons from my first course (2)

Three months later, I’m finally coming back to this blog to finish gathering my thoughts on my first class. And since I took such a long break from my blog, the spring semester is now over, meaning that I can now also reflect on my second and third classes (which went better I think?)

In my preceding post, I tried to highlight what went right and wrong. Now, I’ll try to understand what I can do to make my next classes better by analyzing why things went the way they did.

One big issue during the first class was that I ended the semester being extremely tired. This happened once more during this spring semester. Part of it is understandable: these are my first courses, and I have a lot to learn and so preparing each class takes me a lot of time. Part of the solution is thus going to come from my summer resolution: finishing august with all my material ready for the classes of the autumn semester, and having the material for the spring classes ready by the end of january. Hopefully, I can then use only half a day each week preparing for each class. Hopefully, this will make me less stressed and less worn out when the end of the semester comes.

The second big issue was that I was blindsided by the fact that the students didn’t like the class. That’s pretty easy to fix, however. What I did during this semester was that I asked twice the student representatives to gather feedback from their peers so that I could get a better picture of how they felt. I’ll try to up this to three times for my next classes: once per month. I think this is very important since I feel that students are shy in expressing the problems they have with a class. They are willing to give feedback: I just need to ask it from them. Hopefully, this will be unbiased feedback. My biggest worry is that they won’t be honest with me. I’ll try to be careful here.

On top of this, I feel like I have learned a little about good teaching practices. I’ll take the time to reflect on that in another blog post. Armed with this new knowledge, I hope that my classes next year can be better!

# Lessons from my first course

Exam session for the autumn semester is over. This seems like a fine time to reflect over my first course.

This first course consisted of teaching second year bachelor students, in non-mathematical studies, an introduction to probability and statistics. The content of the class was essentially probability up to the central limit theorem and an introduction to the concept of statistics, with mostly Gaussian models. The hardest statistical topic was the Student t-test for linear regression.

Overall, I think I did a barely passing job for this course. It is of course understandable that not everything can go alright for a first course, but that doesn’t mean that I shouldn’t be honest with myself, my students and the rest of the department. Let’s try to list what went wrong and what went OK.

Things that went well:

• I was a very motivated teacher, and I brought much more energy to my class than most teachers
• I was very available to my students
• I remained (somewhat; you’ll see in a second) attentive to how the course was progressing, and I think that I improved my teaching quite a lot over the semester
• I spoke well
• I didn’t write as poorly as I expected of myself. I still need to work on this point quite a lot

Now, what didn’t go well:

• I was too ambitious for a first course. I redid everything from scratch, when I shouldn’t have. This caused me to commit many mistakes. Namely,
• I didn’t take into account enough what students need. They need a lot of structure: every piece of knowledge should clearly be labeled according to how important it is to understand, etc.
• I crafted a course that I would have liked as a student. This is a bad idea since I was always a very mathematically-minded student, and I was a very good student. My course should be aimed instead at a more practical-level, and towards a slower-pace.
• I completely neglected exercises. They are an integral part of the learning experience and are at least as important as the lectures. This caused students to resent the exercise sessions, and minimized how much they learnt from the class.
• Even though I identified some of these flaws along the way, most of them completely blindsinded me, and only came to surface during the anonymous review of the course by the students. This means I failed to gather meaningful feedback from the class. These issues should have been identified much earlier during the semester, which would have enabled me to correct them much earlier.
• I handled the pressure of teaching pretty poorly, especially at the end of the semester. I’m a very anxious guy, so it’s not surprising that I was stressed, but this went beyond stress. Teaching a semester is a bit like running a marathon: you can’t give all you have during the first half and finish the race crawling, you have to be regular. I need to pace my energy more in the future.

In the next few days, I’ll try to reflect on identifying why things went wrong (and right), and what I will do to make my future classes better.

# Nips workshops day 2

Here is what I learned on the second day of the nips workshops where I went to the deep bayesian networks workshops.

I feel like I should take a second to define precisely what the workshop is about. In a nutshell, it’s about trying to combine Bayesian methods and deep neural networks. On paper this seems like a great idea to augment Bayesian methods with the flexibility of neural networks. However, it is a path that really emphasizes the key weakness of Bayesian methods: the fact that it is riddled with computational problems.
The initial idea that people have been using to deal with the computational problems is “variational inference” (minimizing the “reverse” Kullback-Leibler divergence $KL(q,p)$)

The two highlights of the day were a first talk by Zoubin Ghahramani who gave a nice history lesson on Bayesian neural networks. We do tend to get a little bit caught up in what we are doing, and it’s great to have these talks from time to time to remember the giants whose shoulders we are standing on. The second highlight was a tribute to one of those giants, David Mackay who passed away earlier during the year, by Ryan Adams. I didn’t know Prof. Mackay, but this was a very moving talk and it painted a very vivid picture of him. He seemed like a great guy, and an even greater scientist. He will be missed.

An intriguing idea (which was presented several times during the whole conference) was entitled “Stein variational inference”. It consists in finding a cloud of points to approximate a target probability distribution according to an objective that is reminescent of $KL(q,p)$. I’m not sure how much this differs from using a sparse kernel-based approximation of the log probability of the target distribution. They also had a deep network method that was reminescent of generative adverserial networks.
This has a lot of interesting flavor with the combination of Stein’s method and variational inference and kernel-methods so I definitely need to look at it further

At this point, I was pretty saturated so I just couldn’t follow anymore, but the panel discussion which closed the workshop was pretty great. I’m guessing that these panels are growing on me after all. I really didn’t like them last year at nips, as well as the few that were in the cosyne workshops (I remember one at cosyne that was particularly unproductive). It really depends on the panel and the public’s comment… but it can be great. Overall, I liked the first day of the workshops more, but I’m guessing that, quite simply, that workshop simply aligned a bit more with my interests than today’s one. It was still great though.

# Nips workshops day 1

Here is what I learned during the first day of workshops: at the approximate bayesian inference workshop !

The day started by a very cool idea by Surya Ganguli, which showed how he could train a deep network to learn how to reverse the flow of time! It’s actually a bit less impressive than it sounds, but stil super cool. He was looking at how to model some data set in an unsupervised fashion. His idea was the following. You first design a dynamical process that will decay each data point into white noise (or whatever the appropriate equivalent of that is) and then training a deep-net to reverse the flow of time: take as input a sample from white-noise and return a prediction for the corresponding data point. He then gets machine that can approximately sample from the data distribution ! that was great.

I then gave a talk about expectation-propagation and how it can be viewed as a variant of gradient descent, shedding quite a bit of light on this method. I need to write a blog-post on that.

Closing the morning session was a pannel on the computational aspects of approximate inference methods. Pretty interesting. I really liked the fact that the panel was pretty open to questions from the audience. I like it a bit less when panels are very structured: in most cases, I don’t think this leads to very lively discussions.

During the lunch break, there was an awesome number of super cool posters (where were they during the poster sesssions of the main conference!?) which I’ll come back to in a future post if I have time.  People have so many super-cool ideas, and actually make them work! I love the energy-feeling you get from places like nips.

In the afternoon, we had a great statistical talk by Jeffrey Regier. They managed to run a model to infer the position, angle, color, etc of all stars and galaxies in the night sky (500TB). That. Was. Cool. His model had a large number of variables, interacting in interesting ways, and was trying to model a giant image of the sky. This was way cooler than my description of it.

Closing the day was a panel on the foundations and futures of variational inference. It was pretty interesting, but it was very directed. Richard Turner had a pretty great recap on the advantages of expectation propagation (oh yeah!), Philipp Hennig had a great introduction to probabilistic numerical methods. When probed about the combination of deep nets and bayesian inference, Ryan Adams said that he was hyped but that he also liked probabilistic graphical models and that he wants to combine so that both are playing to their strengths. The final panel made a great point about the fact that, for variational inference, we optimize the posterior without considering what we do afterwards with the approximate posterior / what task we solve. This is a great point, though I’m not sure how to find answers to that. Indeed, if my loss function is $(\theta - \theta_0)^8$, I’m sure that my approximation algorithm should take this into account, but how? Puzzling …