First of all, let’s understand what we are talking about. We have some random variable which we are trying to model with various models . These models are parametric probabilistic models: they specify a function (note that, critically, the dimension of varies between models). AIC and BIC both deal with ideas of how to choose an appropriate model . For a simple example, consider modeling as a Gaussian (), or a mixture of two Gaussians (), or of three, etc.

If we had access to the exact probability distribution of , one thing we could do to compare these various models is the following:

- First, find the value of the parameter such that the probabilities of are the closest to the truth.
- Second, report the distance between the truth and the best approximation in model .
- Third, use this to rank the various models.

One sensible notion of distance we could use is the KL divergence:

which has the benefit of having nice computational properties.

In practice, we won’t be able to compute this KL divergence because we do not have access to the true probability distribution of . What we have most often access to is IID samples from . What we can then do is use these samples to construct an unbiased estimate of the KL divergence between the truth and the best probability distribution inside model . This is precisely what the AIC offers.

More precisely, we do not estimate the KL divergence. We estimate the expected log-likelihood of the best probability distribution inside model . This quantity is equal to the KL up to one unknown common constant which is the entropy of the true distribution. Thus, a ranking of models based on the expected log-likelihoods has the same order as one based on KL divergences. However, we can’t use the log-likelihood at the MLE because that value is biased. What Akaike did was he computed the asymptotic bias and the AIC gives a corrected value which removes this bias. This correction depends on the dimensionality of the parameter and on the number of datapoints in the dataset. Note that other more advanced criteria also exist. The only one worth mentioning is the AICc which is a slightly improved corrections for small datasets.

Thus AIC is about correctly estimating the quantity we should care about for choosing an appropriate model. AIC doesn’t deal directly with choosing an appropriate model. However, we could use the unbiased estimates to then select a model which gives a good account of the data. AIC can also be applied to estimates of a minimum inside a parametric class which are unrelated to likelihoods.

BIC deals with a slightly different problem. Assume that we have a very large (or maybe infinite) amount of nested models: model is a more complex version of model . For example, consider performing linear regression while expanding the set of predictor variables, or the example I already gave of a mixture of Gaussians. In general, the true probability distribution won’t fall inside our class of models. Thus, model will always be a better model than model because its increased flexibility will allow it to capture the complexity of the data better. In such situations, the BIC is inappropriate to use.

However, in some extremely rare examples, it might happen that the true probability distribution is actually inside model (and thus also inside all further models). We could then try to recover this . For example, for linear regression, the true model might be a quadratic polynomial. Thus, trying to fit a third degree polynomial just provides extra degrees of freedom which are not needed. We might then try to learn from the data that a second degree polynomial is sufficient as that would be informative.

BIC focuses on this task of consistently estimating . BIC also represents a correction to the log-likelihood that depends on the number of datapoints and the dimensionality of the model. However, BIC doesn’t aim to correct the bias that is present in that quantity. Its aim is that we can recover by finding the model with minimum BIC. This gives us a consistent estimator of : when the number of datapoints is large enough, we recover the correct value with probability 1.

However, this makes BIC extremely restricted: we can only use it if we assume that somehow we have captured the truth inside one of our models . This is a very bold assumption, and it is 100% wrong, unless you have generated the data yourself. Thankfully, BIC can also be applied to a slightly more realistic case. This is the case in which the model chain is such that, after a certain , the models stop improving: model is exactly as good as model . This can only happen if, for some reason, the extra flexibility is not needed, even though model is not the true model. That could happen. For example, let us return to the regression model. Imagine if the true model is indeed quadratic, but the noise model you are using is incorrect. Then, all models beyond quadratic won’t give an improvement. Let us refer to this case as being the index of the **quasi-true** model. Thankfully for its use, BIC also correctly recovers a quasi-true model (in that it gives us a consistent estimator for it).

It won’t go into details, but like the name indicates, BIC is an approximation to a Bayesian idea. More precisely, BIC is a very rough approximation of the log-posterior distribution over models when the prior is uniform over all models AND when the prior inside each model over the parameters is also flat. Honestly, as an alternative to BIC I would thus use a more Bayesian method with:

- a realistic prior on the models which ranks more complicated models as less likely.
- A realistic prior on the parameter space
- Better approximations than the horrible ones that are used in BIC

This would have the exact same guarantees as BIC (the asymptotic behavior would be identical) while being more principled (at least, appearing more principled to me) before the asymptote.

Thus, it turns out that AIC and BIC are actually slightly different beasts. AIC is all about estimating a “fitting score” for each model in an unbiased fashion. It is thus extremely general. We can then use the unbiased score to decide between the various models at hand, or construct confidence intervals, etc. In contrast, BIC can only be used if, for some reason, we suspect that we are in a situation in which one model is true or quasi-true. Then, we can use BIC to recover . This makes BIC way less useful.

]]>

What annoys me is the fact that so many publications are either low-quality (the same work could be presented in a much clearer fashion) or low-effort (the work represents a marginal improvement over the existing state of the art). A small note: I’m more than fine with incremental work: it is an essential stepping stone in science. Most everything we do is definitely **not** a breakthrough. However, what is extremely annoying is when the authors aren’t straightforward about how their work is incremental. Some results are presented as if the authors are offering a revolutionary approach, even though it’s just the same old crap that they are re-hashing for the third time.

These two flaws make it so that reading articles is extremely unenjoyable and much harder that it has to be: when I’m reading, I want to absorb new knowledge. I really don’t want to fight against the authors to decode whatever they meant, and I really really don’t want to have to remain hyper-attentive to decipher which parts are new and what is old stuff that I already know (and that the authors are probably butchering in their attempts at obfuscation).

I don’t know where these flaws come from and how to fix them. I’m guessing that part of the problem is that researchers fell under so much pressure to produce new articles in order to secure funding/positions/etc. As such, they need to cut corners and this explains the rushed articles and why they are trying to make their contribution sound more impressive than it is (which makes it so that their article gets accepted).

What I can do is I can strive to ensure that my contributions don’t have these flaws (oh the arrogance of youth). I’ll try as much as possible to have my contributions be as clear as I can make them (and I’ll take the time to ensure that this happens: I won’t rush to get something out if it isn’t ready). And, when I do some incremental work, I’ll make sure that I properly document exactly how it is positioned compared to the litterature AND I’ll use such occasions to try to clarify the existing literature. I’ll do so by treating the corresponding article as a tutorial, with the objective that readers that aren’t familiar with the field wouldn’t need to refer to other works to understand the state of the art.

Hopefully, I can follow through on this ideal.

]]>

First, let me present the general “supervised learning” setting. We are given some number of pairs of examples . What we want to accomplish is to learn a function such that . In general, we focus on trying to learn a linear function but more general forms for the function are also possible. The usual approach is to write down a probabilistic model of the conditional on the and and to maximize the likelihood to find the best values for . If we want to be fancy, we can also add a regularizing penalty such the one: .

This is all straightforward statistics. However, there is one key step that many sources often forget to mention and it is **critical**. Quite simply, we need to do a little bit of processing on all the components of our predictor . This processing needs to ensure that the various components are all (approximately) centered and scaled.

This can be justified in a variety of ways, but the one that makes the most sense to me is that we should try to make our methods as invariant to details of the input as they can be , unless we have a very good reason for that. In this case, it is trivial to imagine situations in which the predictors are shifted around for some reason. For example, if we choose different units for a measurement, that would change the scale of the predictors. It’s rarer for the center of the distribution to change, but that can sometimes happen. All of these modifications that could happen shouldn’t change the result of our inference. Thus, our methods should have a step to remove these extra degrees of freedom and ensure that our inference is **invariant**.

Furthermore, consider that we are trying to do is gain information from the . When a value is close to the center of the values, that is a normal value of that should provide us with no information. Thus, it shouldn’t change our evaluation of . This intuition can only occur if the center of is 0. Similarly, in order to know how relevant it is that differs from its center, we need to know the scale at which varies. If the value we are considering is close to 0 **at the relevant scale for** , then again this should have a low-impact on the value we predict for . Centering and scaling the predictors thus ensures that we treat the information we gain from all of them equally.

Now comes the thorny question: how exactly should we center and scale the predictors? Indeed, there are infinitely many notions of the center and scale of a random variable: should we center with the (empirical) mean of the ? or should we prefer the median? Should we scale using the square-root of the variance? Or the deviation: ? Or the inter-quartile spacing? I do not know the appropriate answer to these questions (and honestly, I’m not even sure there is a single appropriate answer). My instinct is to use a robust (key instinct: always be robust) measure of the width so the deviation sounds like a fine choice to me, but the variance is probably fine too.

As a final note, let me talk about one very cool method that people have been using for deep-learning that makes gradient descent work better. This method is called batch normalization. During batch training, we do not compute the output of the deep network as usual. What we do is compute the activity of each layer one by one, and before feeding its output to the next layer, we make sure that the output of each unit in the layer is **centered** and **scaled** (using the empirical mean and variance inside the training batch under consideration). This little trick really improves the speed at which the network learns (I’m not sure if anybody has a good intuition as to why, but I don’t).

As a conclusion, please always remember to **center and scale** your predictors when doing supervised learning!

]]>

First, let us set the stage. We want to infer the value of some d-dimensional parameter: . In order to do so, we are given a single observation which corresponds to corrupted by Gaussian noise with covariance the identity matrix:

How should we estimate from ?

A natural idea consists in using itself. This is the maximum-likelihood estimator of and is indeed a good estimator of . It is the best estimator that is translation-invariant; it is minimax; etc.

However, Stein has shown that is not actually perfect: there exists a whole family of estimators which are better than it, if . These are estimators of the form:

No matter the true value of , these estimators always have lower Mean-Squared Error than . In other words, they always do a better job! In sense, it is thus slightly “stupid” to use instead of them, because you are going to make bigger errors by doing so.

The reason this occurs is due to the strong power of biasing your estimators in high-dimensions. Biasing causes an increase in error, but causes a stronger reduction in the variance, and is thus beneficial. The Stein-estimator creates a bias towards which is able to improve the performance of the estimation. While, other biased estimators such as adding a or penalty (in machine-learning’s horrible jargon, these are known as Lasso and Ridge penalties) would also improve over when close enough to , the more general estimator is better than them when we are far away. However, the Stein-estimator is always better since its bias vanishes when is far from .

When I first saw this result, I was very perplexed by the very peculiar form of the Stein-estimator. There is this tendency in math to present properties like this one as magical: “here is this one guy that just happens to have a crazy property”, instead of detailling where the property comes from. The Stein-estimator is usually presented in this manner, but it does have an interesting origin. Indeed, it is a actually close to a Bayesian estimator (more precisely, he can be derived from an empirical Bayes approach. Since it isn’t truly Bayesian, it should mean that there exists another estimator that dominates it). This makes a little bit of sense, since Bayesian estimators are known to have good Mean Squared Error. Of course, I’m still very curious whether their could be other biases which result in estimators which dominate everywhere. I’m just never satisfied with a single counter-example: I want to know the set of all counter-examples.

Thus, the Stein phenomenon gives us the high-level lesson that “Bias is good (in high-dimensions)”. However, there remains one question: in which direction should we bias our estimate? This is an interesting question.

Indeed, if we have no idea which , one idea is to choose randomly around which constitues a natural first guess for . However, that is stupid, by the following argument:

- Biasing our estimator towards is only good compared to if is pulling us towards the true value .
- If we choose randomly around , half of the values that we choose are going to be further away from than
- Thus, randomly choosing our bias direction is a bad idea: it’s going to be worse than (see: https://arxiv.org/pdf/1203.5626.pdf for a more detailled presentation of that idea).

Thus, biasing is indeed good **if we have true prior information about the direction in which we should bias our guess**. If we don’t, then it is better to use the unbiased estimator since a randomly chosen bias is unlikely to actually achieve a reduction in error.

]]>

My latest read was the classic “All of statistics” by Wasserman which is a great statistics book. Its objective is to give a compact introduction to all / most of statistics in a 250 pages book, and it does so spectacularly well. If you want to understand stats, and have a good understanding of proability and analysis, I honestly can’t recommend enough “all of statistics”. This is a particularly good book for computer science / machine learning students, since they might have picked up some stats concept “on the job” but never had a formal presentation of the full framework, and they likely have all of the background knowledge.

Of course, in order to make the book compact, some things had to be cut: proofs and examples. The book offers proofs of only the most important theorems, the rest being left as an exercise to the reader. There are a few examples, but probably way less than in most books. However, and this might be a very personal opinion, I didn’t mind at all: proofs are important, but they also distract from the flow of learning about a new field such as stat: one can always come back to a more detailled reference book if one really needs to know all the details.

In a nutshell: “All of statistics” is a great book which I warmly recommend. The only drawback is that Prof. Wasserman doesn’t give the most honest presentation of bayesian inference (which, everybody knows, is the best statistical paradigm!).

]]>

First, let me tell you that this book is really good. Gelman and Nolan are clearly very good teachers and they make a really good job of sharing both the very little and the very big pieces of knowledge that they have acquired over the years. The book mostly focuses on activities / examples that can be used to engage with students. The objective is to make statistics more accessible so that the students retain more and get a more intuitive understanding of what statistics is about.

I’m guessing that, to the more experienced teacher, this book might feel light, as they would have already had some (most?) of the ideas that are presented. However, even if you have already had the intuitions, having them laid out clearly by Gelman and Nolan would still be beneficial in my opinion.

The on issue I have with the book is that, because the authors provide such high-quality courses to their students, it seems a little hard for me to adapt their advice to my class in which I do not have the resources for student projects, programming assignments, etc. I think it makes me a little bit jealous: I’d love to be able to do all of that ! I’m going to need to think hard about how to still convey to my students some of the “softer” aspects of statistics: how to collect data in a “right” way, how to correctly summarize information in a graph, etc.

]]>

Today, I’ll discuss “Statistique pour mathématiciens, un premier cours rigoureux” which is a second year Bachelor course aimed at mathematicians which is used in EPFL. I’m reading it because I’m teaching the same course but for engineers next fall, and I’m hoping to find some good ideas about how to explain statistics.

As I write the post, I have read about half of the book (which is fairly short: 240 pages, including a 100 page-long appendix with proofs and exercise corrections). However, it is extremely dense and full of content. It reminds me of “All of statistics” by Wasserman in that this book could be a very good first dive into statistics for someone who feels comfortable with probability theory but who has no knowledge of statistics. Of course, given that it is much shorter that “All of statistics”, it doesn’t have the same depth or breadth, but this book has all the essential points of statistics.

One section which I particularly liked was the introduction. It seems to me that all of the major difficulties of statistics are conceptual: once you understand what you are trying to accomplish, and the way you should frame your questions, the math part of statistics is (should be?) straightforward compared to other classes. However, these conceptual difficulties are either glossed over or very poorly presented. Panaretos does a great job of explaining exactly what statistics is and how it proceeds.

One particularly great example of this is when he describes how one should choose a model to describe a dataset: through a handful of clear and (at least for engineers) familiar examples, he gives great examples of how scientific, philosophical, or exploratory approaches can be used and combined to formulate an appropriate model.

I look forward to finishing reading this very nice book !

]]>

Let’s start by stating the theorem. Let be IID random variables, and let be their empirical median (note that since we have an odd number of datapoints, the empirical is straightforward to find: just order the and find the value at index in the ordered list). The theorem states that, in the limit of a large dataset , the empirical median has a Gaussian distribution. Even better: we can know exactly what kind of random variable it is: it corresponds to a Beta distribution with parameters deformed by the inverse of the CDF of the :

In order to prove this result, let’s first focus on the case in which the are uniform random variables. Let’s compute the probability . We have possibilities for the index of the median. Thus, we can focus on the case in which is the median:

We then need to worry about the other indexes. points need to be above and need to below. This gives possible repartitions. We can once more focus on a single possibility: are below and are above.

Finally, all the are IID independent random variables. The probability above is thus straightforward to compute:

which we recognize as a beta distribution with parameters .

Ok. Now we know what happens for uniform distributions. How can we extend that result to the general case? In order to do so, we have to remember that any random variable can be constructed from a uniform distribution using the inverse of its CDF. Thus, we can construct the as:

Furthermore, the function is monotonous. Thus, the median of the is the image of the median of . Since the median of the has a Beta distribution, this means that the median of the has a deformed Beta distribution:

You might notice that we haven’t talked about Gaussians yet, but we’re almost there. Beta distributions become Gaussian with variance tending to 0 in the limit while the ratio stays constant. This is what happens here when . We have thus proved that in the uniform case, the median becomes Gaussian. Furthermore, because the variance of the Beta also goes to 0 as grows, the non-linear function becomes close to linear and the median becomes close to Gaussian.

Note that this property holds in general for all empirical quantiles of IID datapoints. Would you be able to prove it? You just have to slightly modify the steps which I have presented here.

]]>

In my preceding post, I tried to highlight what went right and wrong. Now, I’ll try to understand what I can do to make my next classes better by analyzing why things went the way they did.

One big issue during the first class was that I ended the semester being extremely tired. This happened once more during this spring semester. Part of it is understandable: these are my first courses, and I have a lot to learn and so preparing each class takes me a lot of time. Part of the solution is thus going to come from my summer resolution: finishing august with all my material ready for the classes of the autumn semester, and having the material for the spring classes ready by the end of january. Hopefully, I can then use only half a day each week preparing for each class. Hopefully, this will make me less stressed and less worn out when the end of the semester comes.

The second big issue was that I was blindsided by the fact that the students didn’t like the class. That’s pretty easy to fix, however. What I did during this semester was that I asked twice the student representatives to gather feedback from their peers so that I could get a better picture of how they felt. I’ll try to up this to three times for my next classes: once per month. I think this is very important since I feel that students are shy in expressing the problems they have with a class. They are willing to give feedback: I just need to ask it from them. Hopefully, this will be unbiased feedback. My biggest worry is that they won’t be honest with me. I’ll try to be careful here.

On top of this, I feel like I have learned a little about good teaching practices. I’ll take the time to reflect on that in another blog post. Armed with this new knowledge, I hope that my classes next year can be better!

]]>

This first course consisted of teaching second year bachelor students, in non-mathematical studies, an introduction to probability and statistics. The content of the class was essentially probability up to the central limit theorem and an introduction to the concept of statistics, with mostly Gaussian models. The hardest statistical topic was the Student t-test for linear regression.

Overall, I think I did a barely passing job for this course. It is of course understandable that not everything can go alright for a first course, but that doesn’t mean that I shouldn’t be honest with myself, my students and the rest of the department. Let’s try to list what went wrong and what went OK.

Things that went well:

- I was a very motivated teacher, and I brought much more energy to my class than most teachers
- I was very available to my students
- I remained (somewhat; you’ll see in a second) attentive to how the course was progressing, and I think that I improved my teaching quite a lot over the semester
- I spoke well
- I didn’t write as poorly as I expected of myself. I still need to work on this point quite a lot

Now, what didn’t go well:

- I was too ambitious for a first course. I redid everything from scratch, when I shouldn’t have. This caused me to commit many mistakes. Namely,
- I didn’t take into account enough what students need. They need a lot of structure: every piece of knowledge should clearly be labeled according to how important it is to understand, etc.
- I crafted a course that I would have liked as a student. This is a bad idea since I was always a very mathematically-minded student, and I was a very good student. My course should be aimed instead at a more practical-level, and towards a slower-pace.
- I completely neglected exercises. They are an integral part of the learning experience and are
**at least as important**as the lectures. This caused students to resent the exercise sessions, and minimized how much they learnt from the class. - Even though I identified some of these flaws along the way, most of them completely blindsinded me, and only came to surface during the anonymous review of the course by the students. This means I failed to gather meaningful feedback from the class. These issues should have been identified much earlier during the semester, which would have enabled me to correct them much earlier.
- I handled the pressure of teaching pretty poorly, especially at the end of the semester. I’m a very anxious guy, so it’s not surprising that I was stressed, but this went beyond stress. Teaching a semester is a bit like running a marathon: you can’t give all you have during the first half and finish the race crawling, you have to be regular. I need to pace my energy more in the future.

In the next few days, I’ll try to reflect on identifying why things went wrong (and right), and what I will do to make my future classes better.

]]>