First, let me tell you that this book is really good. Gelman and Nolan are clearly very good teachers and they make a really good job of sharing both the very little and the very big pieces of knowledge that they have acquired over the years. The book mostly focuses on activities / examples that can be used to engage with students. The objective is to make statistics more accessible so that the students retain more and get a more intuitive understanding of what statistics is about.

I’m guessing that, to the more experienced teacher, this book might feel light, as they would have already had some (most?) of the ideas that are presented. However, even if you have already had the intuitions, having them laid out clearly by Gelman and Nolan would still be beneficial in my opinion.

The on issue I have with the book is that, because the authors provide such high-quality courses to their students, it seems a little hard for me to adapt their advice to my class in which I do not have the resources for student projects, programming assignments, etc. I think it makes me a little bit jealous: I’d love to be able to do all of that ! I’m going to need to think hard about how to still convey to my students some of the “softer” aspects of statistics: how to collect data in a “right” way, how to correctly summarize information in a graph, etc.

]]>

Today, I’ll discuss “Statistique pour mathématiciens, un premier cours rigoureux” which is a second year Bachelor course aimed at mathematicians which is used in EPFL. I’m reading it because I’m teaching the same course but for engineers next fall, and I’m hoping to find some good ideas about how to explain statistics.

As I write the post, I have read about half of the book (which is fairly short: 240 pages, including a 100 page-long appendix with proofs and exercise corrections). However, it is extremely dense and full of content. It reminds me of “All of statistics” by Wasserman in that this book could be a very good first dive into statistics for someone who feels comfortable with probability theory but who has no knowledge of statistics. Of course, given that it is much shorter that “All of statistics”, it doesn’t have the same depth or breadth, but this book has all the essential points of statistics.

One section which I particularly liked was the introduction. It seems to me that all of the major difficulties of statistics are conceptual: once you understand what you are trying to accomplish, and the way you should frame your questions, the math part of statistics is (should be?) straightforward compared to other classes. However, these conceptual difficulties are either glossed over or very poorly presented. Panaretos does a great job of explaining exactly what statistics is and how it proceeds.

One particularly great example of this is when he describes how one should choose a model to describe a dataset: through a handful of clear and (at least for engineers) familiar examples, he gives great examples of how scientific, philosophical, or exploratory approaches can be used and combined to formulate an appropriate model.

I look forward to finishing reading this very nice book !

]]>

Let’s start by stating the theorem. Let be IID random variables, and let be their empirical median (note that since we have an odd number of datapoints, the empirical is straightforward to find: just order the and find the value at index in the ordered list). The theorem states that, in the limit of a large dataset , the empirical median has a Gaussian distribution. Even better: we can know exactly what kind of random variable it is: it corresponds to a Beta distribution with parameters deformed by the inverse of the CDF of the :

In order to prove this result, let’s first focus on the case in which the are uniform random variables. Let’s compute the probability . We have possibilities for the index of the median. Thus, we can focus on the case in which is the median:

We then need to worry about the other indexes. points need to be above and need to below. This gives possible repartitions. We can once more focus on a single possibility: are below and are above.

Finally, all the are IID independent random variables. The probability above is thus straightforward to compute:

which we recognize as a beta distribution with parameters .

Ok. Now we know what happens for uniform distributions. How can we extend that result to the general case? In order to do so, we have to remember that any random variable can be constructed from a uniform distribution using the inverse of its CDF. Thus, we can construct the as:

Furthermore, the function is monotonous. Thus, the median of the is the image of the median of . Since the median of the has a Beta distribution, this means that the median of the has a deformed Beta distribution:

You might notice that we haven’t talked about Gaussians yet, but we’re almost there. Beta distributions become Gaussian with variance tending to 0 in the limit while the ratio stays constant. This is what happens here when . We have thus proved that in the uniform case, the median becomes Gaussian. Furthermore, because the variance of the Beta also goes to 0 as grows, the non-linear function becomes close to linear and the median becomes close to Gaussian.

Note that this property holds in general for all empirical quantiles of IID datapoints. Would you be able to prove it? You just have to slightly modify the steps which I have presented here.

]]>

In my preceding post, I tried to highlight what went right and wrong. Now, I’ll try to understand what I can do to make my next classes better by analyzing why things went the way they did.

One big issue during the first class was that I ended the semester being extremely tired. This happened once more during this spring semester. Part of it is understandable: these are my first courses, and I have a lot to learn and so preparing each class takes me a lot of time. Part of the solution is thus going to come from my summer resolution: finishing august with all my material ready for the classes of the autumn semester, and having the material for the spring classes ready by the end of january. Hopefully, I can then use only half a day each week preparing for each class. Hopefully, this will make me less stressed and less worn out when the end of the semester comes.

The second big issue was that I was blindsided by the fact that the students didn’t like the class. That’s pretty easy to fix, however. What I did during this semester was that I asked twice the student representatives to gather feedback from their peers so that I could get a better picture of how they felt. I’ll try to up this to three times for my next classes: once per month. I think this is very important since I feel that students are shy in expressing the problems they have with a class. They are willing to give feedback: I just need to ask it from them. Hopefully, this will be unbiased feedback. My biggest worry is that they won’t be honest with me. I’ll try to be careful here.

On top of this, I feel like I have learned a little about good teaching practices. I’ll take the time to reflect on that in another blog post. Armed with this new knowledge, I hope that my classes next year can be better!

]]>

This first course consisted of teaching second year bachelor students, in non-mathematical studies, an introduction to probability and statistics. The content of the class was essentially probability up to the central limit theorem and an introduction to the concept of statistics, with mostly Gaussian models. The hardest statistical topic was the Student t-test for linear regression.

Overall, I think I did a barely passing job for this course. It is of course understandable that not everything can go alright for a first course, but that doesn’t mean that I shouldn’t be honest with myself, my students and the rest of the department. Let’s try to list what went wrong and what went OK.

Things that went well:

- I was a very motivated teacher, and I brought much more energy to my class than most teachers
- I was very available to my students
- I remained (somewhat; you’ll see in a second) attentive to how the course was progressing, and I think that I improved my teaching quite a lot over the semester
- I spoke well
- I didn’t write as poorly as I expected of myself. I still need to work on this point quite a lot

Now, what didn’t go well:

- I was too ambitious for a first course. I redid everything from scratch, when I shouldn’t have. This caused me to commit many mistakes. Namely,
- I didn’t take into account enough what students need. They need a lot of structure: every piece of knowledge should clearly be labeled according to how important it is to understand, etc.
- I crafted a course that I would have liked as a student. This is a bad idea since I was always a very mathematically-minded student, and I was a very good student. My course should be aimed instead at a more practical-level, and towards a slower-pace.
- I completely neglected exercises. They are an integral part of the learning experience and are
**at least as important**as the lectures. This caused students to resent the exercise sessions, and minimized how much they learnt from the class. - Even though I identified some of these flaws along the way, most of them completely blindsinded me, and only came to surface during the anonymous review of the course by the students. This means I failed to gather meaningful feedback from the class. These issues should have been identified much earlier during the semester, which would have enabled me to correct them much earlier.
- I handled the pressure of teaching pretty poorly, especially at the end of the semester. I’m a very anxious guy, so it’s not surprising that I was stressed, but this went beyond stress. Teaching a semester is a bit like running a marathon: you can’t give all you have during the first half and finish the race crawling, you have to be regular. I need to pace my energy more in the future.

In the next few days, I’ll try to reflect on identifying why things went wrong (and right), and what I will do to make my future classes better.

]]>

I feel like I should take a second to define precisely what the workshop is about. In a nutshell, it’s about trying to combine Bayesian methods and deep neural networks. On paper this seems like a great idea to augment Bayesian methods with the flexibility of neural networks. However, it is a path that really emphasizes the key weakness of Bayesian methods: the fact that it is riddled with computational problems.

The initial idea that people have been using to deal with the computational problems is “variational inference” (minimizing the “reverse” Kullback-Leibler divergence )

The two highlights of the day were a first talk by Zoubin Ghahramani who gave a nice history lesson on Bayesian neural networks. We do tend to get a little bit caught up in what we are doing, and it’s great to have these talks from time to time to remember the giants whose shoulders we are standing on. The second highlight was a tribute to one of those giants, David Mackay who passed away earlier during the year, by Ryan Adams. I didn’t know Prof. Mackay, but this was a very moving talk and it painted a very vivid picture of him. He seemed like a great guy, and an even greater scientist. He will be missed.

An intriguing idea (which was presented several times during the whole conference) was entitled “Stein variational inference”. It consists in finding a cloud of points to approximate a target probability distribution according to an objective that is reminescent of . I’m not sure how much this differs from using a sparse kernel-based approximation of the log probability of the target distribution. They also had a deep network method that was reminescent of generative adverserial networks.

This has a lot of interesting flavor with the combination of Stein’s method and variational inference and kernel-methods so I definitely need to look at it further

At this point, I was pretty saturated so I just couldn’t follow anymore, but the panel discussion which closed the workshop was pretty great. I’m guessing that these panels are growing on me after all. I really didn’t like them last year at nips, as well as the few that were in the cosyne workshops (I remember one at cosyne that was particularly unproductive). It really depends on the panel and the public’s comment… but it can be great. Overall, I liked the first day of the workshops more, but I’m guessing that, quite simply, that workshop simply aligned a bit more with my interests than today’s one. It was still great though.

]]>

The day started by a very cool idea by Surya Ganguli, which showed how he could train a deep network to learn how to reverse the flow of time! It’s actually a bit less impressive than it sounds, but stil super cool. He was looking at how to model some data set in an unsupervised fashion. His idea was the following. You first design a dynamical process that will decay each data point into white noise (or whatever the appropriate equivalent of that is) and then training a deep-net to reverse the flow of time: take as input a sample from white-noise and return a prediction for the corresponding data point. He then gets machine that can approximately sample from the data distribution ! that was great.

I then gave a talk about expectation-propagation and how it can be viewed as a variant of gradient descent, shedding quite a bit of light on this method. I need to write a blog-post on that.

Closing the morning session was a pannel on the computational aspects of approximate inference methods. Pretty interesting. I really liked the fact that the panel was pretty open to questions from the audience. I like it a bit less when panels are very structured: in most cases, I don’t think this leads to very lively discussions.

During the lunch break, there was an awesome number of super cool posters (where were they during the poster sesssions of the main conference!?) which I’ll come back to in a future post if I have time. People have so many super-cool ideas, and actually make them work! I love the energy-feeling you get from places like nips.

In the afternoon, we had a great statistical talk by Jeffrey Regier. They managed to run a model to infer the position, angle, color, etc of all stars and galaxies in the night sky (500TB). That. Was. Cool. His model had a large number of variables, interacting in interesting ways, and was trying to model a giant image of the sky. This was way cooler than my description of it.

Closing the day was a panel on the foundations and futures of variational inference. It was pretty interesting, but it was very directed. Richard Turner had a pretty great recap on the advantages of * expectation propagation* (oh yeah!), Philipp Hennig had a great introduction to probabilistic numerical methods. When probed about the combination of deep nets and bayesian inference, Ryan Adams said that he was hyped but that he also liked probabilistic graphical models and that he wants to combine so that both are playing to their strengths. The final panel made a great point about the fact that, for variational inference, we optimize the posterior without considering what we do afterwards with the approximate posterior / what task we solve. This is a great point, though I’m not sure how to find answers to that. Indeed, if my loss function is , I’m sure that my approximation algorithm should take this into account, but how? Puzzling …

]]>

The day started by a very very cool talk by Kyle Cranmer from CERN. He did a great job explaining how physicists performed the statistical analysis behind the discovery of the Higgs boson. His talk went in great detail but remained extremely easy to understand and was absolutely great (probably my favorite of the whole conference). He then also had a few great proposals for interesting problems that the community might be interested in modeling.

In the afternoon, there was a very entertaining talk by Marc Raibert from the boston dynamics company. They do the wonderful robots that you probably have seen on youtube. I was a little bit sad that he didn’t give us too much insight into how they actually make the robots move but he did show a lot and had lots of entertaining videos so he gets a shout-out.

Another cool idea in the afternoon, that was a lot more focused on practical theory, was presented by Damien Scieur, Alexandre d’Aspremont and Francis Bach. They showed a post-processing method for optimizing systems. Their method is a very small and cheap trick that you can plugin to any optimization method to make it better. Overall, it was pretty cool, but I’m not sure if I understand optimization well enough to give the best commentary on that.

]]>

In the morning, Matthias Seeger (Amazon) presented the model that they use to forecast demand. That was very interesting. That was basically a linear regression with handcrafted features (he didn’t go in detail there about how they were crafted). The regression itself was also interesting: they want to regress the number of sales in the day to predictors; they did so by having two logistic regression for determining whether you would get 1 sale or more, 2 sales or more, and then a Poisson likelihood for all sales higher than that. This makes the system able to model the fact that the first few values are much higher than they should be in a Poisson model, while modeling the tail behavior in a simple way.

Another interesting bit: they have periods where products aren’t in stock, so they can’t be sold. If they ignore these periods, their model goes wrong (it under-estimates the desirability of the product). If they model these correctly: they are no sales because the product is out-of-stock, then their model is way better. As always, the generative model should match the physical realities of the data.

In the afternoon, there was also a great talk by Saket Navlakha from the Salk institute. I’m normally not a great fan of bio-mimetics approach (trying to use biology to improve machine learning techniques, or, more generally, engineering technology; in most cases, the constraints we deal with as engineers are extremely different from the constraints that biology is dealing with) but he had two great examples of exactly that. The first one consisted of creating a communication network by removing edges from a fully-connected (or just very dense) network.

Final great thing was a very particular mixture of Bayesian methods and deep learning by Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet and Ole Winther. They wanted to model sequential data (recordings of speech). A good idea for a model would use a latent markov-chain, conditional on which the sound at each time step is independent (a hidden-markov-chain model). Normally, people would use a very simple latent model: a Kalman filter. This doesn’t work well at all as it as too simple dynamics compared to the actual dynamics of the signal being modelled. What Marco and co-authors propose to use instetad is to use a deep-network to represent the dynamics of the latent space oO. They then are able to perform inference on the weights of their deep network !!?! and finally they learn to use an * inference network* to take a soundwave as an input and return a variational inference approximation of the posterior on the latent space !!!?!??!!?! This sounds insane, but it actually works and works very well.

I love these extremely creative combinations of deep-learning with bayesian inference: there was a lot of it at nips this year and I’m very excited to see where it goes in the future.

]]>

Day 1 action report.

Day 1 was dedicated to tutorials: long talks with the objective of introducing the audience to important topics of machine learning.

I first attended a tutorial on variational inference by David Blei, Shakir Mohamed and Rajesh Ranganath. It was really cool. They presented variational inference (i.e: minimizing the reverse-KL divergence inside some restricted class of distributions to the target distribution) in a way that made understanding everything that they do very clear. They first presented an overview of the Bayesian approach, then showed how variational inference can be used on almost-all situations, and finally showed some cool applications mixing Bayesian inference and deep nets. I don’t think their slides are online yet, sadly However, David Blei and co-authors have a review on the subject which seems worth checking out: https://arxiv.org/abs/1601.00670

The second exciting thing I got to learn about concerned gradient methods. I know very little about those, so I went to listen to Francis Bach and Suvrit Sra to learn more. Their talks were focused on showing the prowess of SVRGD (“stochastic variance reduction gradient descent”, if I’m not mistaken) when compared to alternatives. The key idea of SVRGD is:

- we do stochastic gradient descent
- but, we keep the gradients which we have computed in memory and, even though they are “stale”: they do not correspond to the point at which we are at, we follow the sum of all gradients we have in memory

This approach, surprisingly, works super well and just wipes the floor with SGD and deterministic gradient methods. I’ll be sure to read the corresponding papers in detail (some samples: https://arxiv.org/abs/1202.6258 , https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf ). Slides are available at: http://www.di.ens.fr/~fbach/fbach_tutorial_vr_nips_2016.pdf and http://www.di.ens.fr/~fbach/ssra_tutorial_vr_nips_2016.pdf .

I’m sure the rest of the conference will be at least as good !

]]>