Understanding AIC and BIC

I’ve spent the better part of the day trying to understand AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion; note that this is a misleading name: BIC has nothing to do with information theory). I’m quite happy to have understood both (partially). Let me try to re-explain it to you.


First of all, let’s understand what we are talking about. We have some random variable X which we are trying to model with various models M_k. These models are parametric probabilistic models: they specify a function p(X|\theta_k, M_k) (note that, critically, the dimension of \theta_k varies between models). AIC and BIC both deal with ideas of how to choose an appropriate model M_k. For a simple example, consider modeling X as a Gaussian (k=1), or a mixture of two Gaussians (k=2), or of three, etc.

If we had access to the exact probability distribution of X, one thing we could do to compare these various models M_k is the following:

  • First, find the value of the parameter \theta_k such that the probabilities of p(X|\theta_k,M_k) are the closest to the truth.
  • Second, report the distance between the truth and the best approximation in model M_k.
  • Third, use this to rank the various models.

One sensible notion of distance we could use is the KL divergence:

\displaystyle{ KL(p(X), q(X) = \int p(x) \log{(\frac{p(x)}{q(x)})} dx}

which has the benefit of having nice computational properties.

In practice, we won’t be able to compute this KL divergence because we do not have access to the true probability distribution of X. What we have most often access to is IID samples from X. What we can then do is use these samples to construct an unbiased estimate of the KL divergence between the truth and the best probability distribution inside model M_k. This is precisely what the AIC offers.

More precisely, we do not estimate the KL divergence. We estimate the expected log-likelihood of the best probability distribution inside model M_k. This quantity is equal to the KL up to one unknown common constant which is the entropy of the true distribution. Thus, a ranking of models based on the expected log-likelihoods has the same order as one based on KL divergences. However, we can’t use the log-likelihood at the MLE \theta_k^* because that value is biased. What Akaike did was he computed the asymptotic bias and the AIC gives a corrected value which removes this bias. This correction depends on the dimensionality of the parameter \theta_k and on the number of datapoints in the dataset. Note that other more advanced criteria also exist. The only one worth mentioning is the AICc which is a slightly improved corrections for small datasets.

Thus AIC is about correctly estimating the quantity we should care about for choosing an appropriate model. AIC doesn’t deal directly with choosing an appropriate model. However, we could use the unbiased estimates to then select a model which gives a good account of the data. AIC can also be applied to estimates of a minimum inside a parametric class which are unrelated to likelihoods.


BIC deals with a slightly different problem. Assume that we have a very large (or maybe infinite) amount of nested models: model M_{k+1} is a more complex version of model M_k. For example, consider performing linear regression while expanding the set of predictor variables, or the example I already gave of a mixture of k Gaussians. In general, the true probability distribution won’t fall inside our class of models. Thus, model M_{k+1} will always be a better model than model M_k because its increased flexibility will allow it to capture the complexity of the data better. In such situations, the BIC is inappropriate to use.

However, in some extremely rare examples, it might happen that the true probability distribution is actually inside model M_{k^*} (and thus also inside all further models). We could then try to recover this k^*. For example, for linear regression, the true model might be a quadratic polynomial. Thus, trying to fit a third degree polynomial just provides extra degrees of freedom which are not needed. We might then try to learn from the data that a second degree polynomial is sufficient as that would be informative.

BIC focuses on this task of consistently estimating k^*. BIC also represents a correction to the log-likelihood that depends on the number of datapoints and the dimensionality of the model. However, BIC doesn’t aim to correct the bias that is present in that quantity. Its aim is that we can recover k^* by finding the model with minimum BIC. This gives us a consistent estimator of k^*: when the number of datapoints n is large enough, we recover the correct value with probability 1.

However, this makes BIC extremely restricted: we can only use it if we assume that somehow we have captured the truth inside one of our models M_k. This is a very bold assumption, and it is 100% wrong, unless you have generated the data yourself. Thankfully, BIC can also be applied to a slightly more realistic case. This is the case in which the model chain is such that, after a certain k^*, the models stop improving: model M_{k^*+1} is exactly as good as model M_{k^*}. This can only happen if, for some reason, the extra flexibility is not needed, even though model M_{k^*} is not the true model. That could happen. For example, let us return to the regression model. Imagine if the true model is indeed quadratic, but the noise model you are using is incorrect. Then, all models beyond quadratic won’t give an improvement. Let us refer to this case as k^* being the index of the quasi-true model. Thankfully for its use, BIC also correctly recovers a quasi-true model (in that it gives us a consistent estimator for it).

It won’t go into details, but like the name indicates, BIC is an approximation to a Bayesian idea. More precisely, BIC is a very rough approximation of the log-posterior distribution over models when the prior is uniform over all models AND when the prior inside each model over the parameters is also flat. Honestly, as an alternative to BIC I would thus use a more Bayesian method with:

  • a realistic prior on the models which ranks more complicated models as less likely.
  • A realistic prior on the parameter space
  • Better approximations than the horrible ones that are used in BIC

This would have the exact same guarantees as BIC (the asymptotic behavior would be identical) while being more principled (at least, appearing more principled to me) before the asymptote.


Thus, it turns out that AIC and BIC are actually slightly different beasts. AIC is all about estimating a “fitting score” for each model in an unbiased fashion. It is thus extremely general. We can then use the unbiased score to decide between the various models at hand, or construct confidence intervals, etc. In contrast, BIC can only be used if, for some reason, we suspect that we are in a situation in which one model M_{k^*} is true or quasi-true. Then, we can use BIC to recover k^*. This makes BIC way less useful.



Noise in the publishing system

Today, I’d like to rant on what I perceive to be flaws of the current system of publishing (or, more accurately, on the flaws of how people treat our current system). Of course, like all rants from young people, please take it with a large pinch of salt: I know I know nothing, but I’d just like to be able to vent.


What annoys me is the fact that so many publications are either low-quality (the same work could be presented in a much clearer fashion) or low-effort (the work represents a marginal improvement over the existing state of the art). A small note: I’m more than fine with incremental work: it is an essential stepping stone in science. Most everything we do is definitely not a breakthrough. However, what is extremely annoying is when the authors aren’t straightforward about how their work is incremental. Some results are presented as if the authors are offering a revolutionary approach, even though it’s just the same old crap that they are re-hashing for the third time.

These two flaws make it so that reading articles is extremely unenjoyable and much harder that it has to be: when I’m reading, I want to absorb new knowledge. I really don’t want to fight against the authors to decode whatever they meant, and I really really don’t want to have to remain hyper-attentive to decipher which parts are new and what is old stuff that I already know (and that the authors are probably butchering in their attempts at obfuscation).

I don’t know where these flaws come from and how to fix them. I’m guessing that part of the problem is that researchers fell under so much pressure to produce new articles in order to secure funding/positions/etc. As such, they need to cut corners and this explains the rushed articles and why they are trying to make their contribution sound more impressive than it is (which makes it so that their article gets accepted).

What I can do is I can strive to ensure that my contributions don’t have these flaws (oh the arrogance of youth). I’ll try as much as possible to have my contributions be as clear as I can make them (and I’ll take the time to ensure that this happens: I won’t rush to get something out if it isn’t ready). And, when I do some incremental work, I’ll make sure that I properly document exactly how it is positioned compared to the litterature AND I’ll use such occasions to try to clarify the existing literature. I’ll do so by treating the corresponding article as a tutorial, with the objective that readers that aren’t familiar with the field wouldn’t need to refer to other works to understand the state of the art.

Hopefully, I can follow through on this ideal.



Practical statistics basics: center and scale your predictors

Today, I want to discuss something that seems extremely small but is critical in “supervised” problems in which you are trying to predict some data Y from some other data X. In a nutshell, you should always make sure that your predictors are centered (their center is 0) and scaled (their width is 1). Let’s dive into the details!


First, let me present the general “supervised learning” setting. We are given some number of pairs of examples (X,Y). What we want to accomplish is to learn a function such that Y \approx f(X). In general, we focus on trying to learn a linear function f(X) = \sum \theta_i X_i but more general forms for the function are also possible. The usual approach is to write down a probabilistic model of the Y conditional on the X and \theta and to maximize the likelihood to find the best values for \theta. If we want to be fancy, we can also add a regularizing penalty such the L_2 one: \sum \theta_i^2.

This is all straightforward statistics. However, there is one key step that many sources often forget to mention and it is critical. Quite simply, we need to do a little bit of processing on all the components of our predictor X. This processing needs to ensure that the various components X_i are all (approximately) centered and scaled.

This can be justified in a variety of ways, but the one that makes the most sense to me is that we should try to make our methods as invariant to details of the input as they can be , unless we have a very good reason for that. In this case, it is trivial to imagine situations in which the predictors are shifted around for some reason. For example, if we choose different units for a measurement, that would change the scale of the predictors. It’s rarer for the center of the distribution to change, but that can sometimes happen. All of these modifications that could happen shouldn’t change the result of our inference. Thus, our methods should have a step to remove these extra degrees of freedom and ensure that our inference is invariant.

Furthermore, consider that we are trying to do is gain information from the X_i. When a value x_i is close to the center of the X_i values, that is a normal value of X_i that should provide us with no information. Thus, it shouldn’t change our evaluation of Y. This intuition can only occur if the center of X_i is 0. Similarly, in order to know how relevant it is that x_i differs from its center, we need to know the scale at which X_i varies. If the value we are considering is close to 0 at the relevant scale for X_i, then again this should have a low-impact on the value we predict for Y. Centering and scaling the predictors thus ensures that we treat the information we gain from all of them equally.


Now comes the thorny question: how exactly should we center and scale the predictors? Indeed, there are infinitely many notions of the center and scale of a random variable: should we center with the (empirical) mean of the x_i ? or should we prefer the median? Should we scale using the square-root of the variance? Or the L_1 deviation: min_{m} E( |X_i - m|) ? Or the inter-quartile spacing? I do not know the appropriate answer to these questions (and honestly, I’m not even sure there is a single appropriate answer). My instinct is to use a robust (key instinct: always be robust) measure of the width so the L_1 deviation sounds like a fine choice to me, but the variance is probably fine too.


As a final note, let me talk about one very cool method that people have been using for deep-learning that makes gradient descent work better. This method is called batch normalization. During batch training, we do not compute the output of the deep network as usual. What we do is compute the activity of each layer one by one, and before feeding its output to the next layer, we make sure that the output of each unit in the layer is centered and scaled (using the empirical mean and variance inside the training batch under consideration). This little trick really improves the speed at which the network learns (I’m not sure if anybody has a good intuition as to why, but I don’t).


As a conclusion, please always remember to center and scale your predictors when doing supervised learning!



Some remarks on shrinking and the Stein phenomenon

I’ve been reading on the bias-variance trade-off and, most importantly, on the Stein phenomenon. Here are some of my thoughts on the subject which I hope can help others with this slightly thorny subject.


First, let us set the stage. We want to infer the value of some d-dimensional parameter: \theta. In order to do so, we are given a single observation X which corresponds to \theta corrupted by Gaussian noise with covariance the identity matrix:

\displaystyle{ X = \theta + \eta }

How should we estimate \theta from X ?


A natural idea consists in using X itself. This is the maximum-likelihood estimator of \theta and is indeed a good estimator of \theta. It is the best estimator that is translation-invariant; it is minimax; etc.

However, Stein has shown that X is not actually perfect: there exists a whole family of estimators which are better than it, if d \geq 3. These are estimators of the form:

\displaystyle{ \hat{\theta} = \left(1 - \frac{d-2}{\|X -\theta_0 \|} \right) (X-\theta_0) +\theta_0}

No matter the true value of \theta, these estimators always have lower Mean-Squared Error than X. In other words, they always do a better job! In sense, it is thus slightly “stupid” to use X instead of them, because you are going to make bigger errors by doing so.


The reason this occurs is due to the strong power of biasing your estimators in high-dimensions. Biasing causes an increase in error, but causes a stronger reduction in the variance, and is thus beneficial. The Stein-estimator creates a bias towards \theta_0 which is able to improve the performance of the estimation. While, other biased estimators such as adding a L_1 or L_2 penalty (in machine-learning’s horrible jargon, these are known as Lasso and Ridge penalties) would also improve over X when close enough to \theta_0, the more general estimator X is better than them when we are far away. However, the Stein-estimator is always better since its bias vanishes when X is far from \theta_0.


When I first saw this result, I was very perplexed by the very peculiar form of the Stein-estimator. There is this tendency in math to present properties like this one as magical: “here is this one guy that just happens to have a crazy property”, instead of detailling where the property comes from. The Stein-estimator is usually presented in this manner, but it does have an interesting origin. Indeed, it is a actually close to a Bayesian estimator (more precisely, he can be derived from an empirical Bayes approach. Since it isn’t truly Bayesian, it should mean that there exists another estimator that dominates it). This makes a little bit of sense, since Bayesian estimators are known to have good Mean Squared Error. Of course, I’m still very curious whether their could be other biases which result in estimators which dominate X everywhere. I’m just never satisfied with a single counter-example: I want to know the set of all counter-examples.


Thus, the Stein phenomenon gives us the high-level lesson that “Bias is good (in high-dimensions)”. However, there remains one question: in which direction should we bias our estimate? This is an interesting question.

Indeed, if we have no idea which \theta_0, one idea is to choose randomly around X which constitues a natural first guess for \theta. However, that is stupid, by the following argument:

  • Biasing our estimator towards \theta_0 is only good compared to X if \theta_0 is pulling us towards the true value \theta.
  • If we choose randomly around X, half of the values that we choose are going to be further away from \theta than X
  • Thus, randomly choosing our bias direction is a bad idea: it’s going to be worse than X (see: https://arxiv.org/pdf/1203.5626.pdf for a more detailled presentation of that idea).

Thus, biasing is indeed good if we have true prior information about the direction in which we should bias our guess. If we don’t, then it is better to use the unbiased estimator X since a randomly chosen bias is unlikely to actually achieve a reduction in error.

Reading: “All of statistics” (Wasserman) a second time

I’ve spent almost this whole week reading (and now my head hurts. However, since correlation is not causation, I don’t feel like concluding a causal relationship between the two).

My latest read was the classic “All of statistics” by Wasserman which is a great statistics book. Its objective is to give a compact introduction to all / most of statistics in a 250 pages book, and it does so spectacularly well. If you want to understand stats, and have a good understanding of proability and analysis, I honestly can’t recommend enough “all of statistics”. This is a particularly good book for computer science / machine learning students, since they might have picked up some stats concept “on the job” but never had a formal presentation of the full framework, and they likely have all of the background knowledge.

Of course, in order to make the book compact, some things had to be cut: proofs and examples. The book offers proofs of only the most important theorems, the rest being left as an exercise to the reader. There are a few examples, but probably way less than in most books. However, and this might be a very personal opinion, I didn’t mind at all: proofs are important, but they also distract from the flow of learning about a new field such as stat: one can always come back to a more detailled reference book if one really needs to know all the details.

In a nutshell: “All of statistics” is a great book which I warmly recommend. The only drawback is that Prof. Wasserman doesn’t give the most honest presentation of bayesian inference (which, everybody knows, is the best statistical paradigm!).

Reading: “Teaching statistics” (Gelman & Nolan)

I got some more reading done in the last few days. This time I was reading “Teaching statistics: a bag a tricks” by A. Gelman and D. Nolan. The book aims at giving lots of examples and ideas to make an introductory statistics lecture better (and it also has chapters at the end that give ideas for some more advanced classes. Critically, they have one on teaching Bayesian inference, which I’ll come back to when I’m preparing that class).


First, let me tell you that this book is really good. Gelman and Nolan are clearly very good teachers and they make a really good job of sharing both the very little and the very big pieces of knowledge that they have acquired over the years. The book mostly focuses on activities / examples that can be used to engage with students. The objective is to make statistics more accessible so that the students retain more and get a more intuitive understanding of what statistics is about.

I’m guessing that, to the more experienced teacher, this book might feel light, as they would have already had some (most?) of the ideas that are presented. However, even if you have already had the intuitions, having them laid out clearly by Gelman and Nolan would still be beneficial in my opinion.


The on issue I have with the book is that, because the authors provide such high-quality courses to their students, it seems a little hard for me to adapt their advice to my class in which I do not have the resources for student projects, programming assignments, etc. I think it makes me a little bit jealous: I’d love to be able to do all of that ! I’m going to need to think hard about how to still convey to my students some of the “softer” aspects of statistics: how to collect data in a “right” way, how to correctly summarize information in a graph, etc.

Reading: “Statistique pour mathématiciens” (V. Panaretos)

I’m going to be using the summer break to get some reading done. I’ll do one or more post on every book I read so that I can share a little bit of what I learn.

Today, I’ll discuss “Statistique pour mathématiciens, un premier cours rigoureux” which is a second year Bachelor course aimed at mathematicians which is used in EPFL. I’m reading it because I’m teaching the same course but for engineers next fall, and I’m hoping to find some good ideas about how to explain statistics.

As I write the post, I have read about half of the book (which is fairly short: 240 pages, including a 100 page-long appendix with proofs and exercise corrections). However, it is extremely dense and full of content. It reminds me of “All of statistics” by Wasserman in that this book could be a very good first dive into statistics for someone who feels comfortable with probability theory but who has no knowledge of statistics. Of course, given that it is much shorter that “All of statistics”, it doesn’t have the same depth or breadth, but this book has all the essential points of statistics.

One section which I particularly liked was the introduction. It seems to me that all of the major difficulties of statistics are conceptual: once you understand what you are trying to accomplish, and the way you should frame your questions, the math part of statistics is (should be?) straightforward compared to other classes. However, these conceptual difficulties are either glossed over or very poorly presented. Panaretos does a great job of explaining exactly what statistics is and how it proceeds.

One particularly great example of this is when he describes how one should choose a model to describe a dataset: through a handful of clear and (at least for engineers) familiar examples, he gives great examples of how scientific, philosophical, or exploratory approaches can be used and combined to formulate an appropriate model.


I look forward to finishing reading this very nice book !