Understanding AIC and BIC

I’ve spent the better part of the day trying to understand AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion; note that this is a misleading name: BIC has nothing to do with information theory). I’m quite happy to have understood both (partially). Let me try to re-explain it to you.

 

First of all, let’s understand what we are talking about. We have some random variable X which we are trying to model with various models M_k. These models are parametric probabilistic models: they specify a function p(X|\theta_k, M_k) (note that, critically, the dimension of \theta_k varies between models). AIC and BIC both deal with ideas of how to choose an appropriate model M_k. For a simple example, consider modeling X as a Gaussian (k=1), or a mixture of two Gaussians (k=2), or of three, etc.

If we had access to the exact probability distribution of X, one thing we could do to compare these various models M_k is the following:

  • First, find the value of the parameter \theta_k such that the probabilities of p(X|\theta_k,M_k) are the closest to the truth.
  • Second, report the distance between the truth and the best approximation in model M_k.
  • Third, use this to rank the various models.

One sensible notion of distance we could use is the KL divergence:

\displaystyle{ KL(p(X), q(X) = \int p(x) \log{(\frac{p(x)}{q(x)})} dx}

which has the benefit of having nice computational properties.

In practice, we won’t be able to compute this KL divergence because we do not have access to the true probability distribution of X. What we have most often access to is IID samples from X. What we can then do is use these samples to construct an unbiased estimate of the KL divergence between the truth and the best probability distribution inside model M_k. This is precisely what the AIC offers.

More precisely, we do not estimate the KL divergence. We estimate the expected log-likelihood of the best probability distribution inside model M_k. This quantity is equal to the KL up to one unknown common constant which is the entropy of the true distribution. Thus, a ranking of models based on the expected log-likelihoods has the same order as one based on KL divergences. However, we can’t use the log-likelihood at the MLE \theta_k^* because that value is biased. What Akaike did was he computed the asymptotic bias and the AIC gives a corrected value which removes this bias. This correction depends on the dimensionality of the parameter \theta_k and on the number of datapoints in the dataset. Note that other more advanced criteria also exist. The only one worth mentioning is the AICc which is a slightly improved corrections for small datasets.

Thus AIC is about correctly estimating the quantity we should care about for choosing an appropriate model. AIC doesn’t deal directly with choosing an appropriate model. However, we could use the unbiased estimates to then select a model which gives a good account of the data. AIC can also be applied to estimates of a minimum inside a parametric class which are unrelated to likelihoods.

 

BIC deals with a slightly different problem. Assume that we have a very large (or maybe infinite) amount of nested models: model M_{k+1} is a more complex version of model M_k. For example, consider performing linear regression while expanding the set of predictor variables, or the example I already gave of a mixture of k Gaussians. In general, the true probability distribution won’t fall inside our class of models. Thus, model M_{k+1} will always be a better model than model M_k because its increased flexibility will allow it to capture the complexity of the data better. In such situations, the BIC is inappropriate to use.

However, in some extremely rare examples, it might happen that the true probability distribution is actually inside model M_{k^*} (and thus also inside all further models). We could then try to recover this k^*. For example, for linear regression, the true model might be a quadratic polynomial. Thus, trying to fit a third degree polynomial just provides extra degrees of freedom which are not needed. We might then try to learn from the data that a second degree polynomial is sufficient as that would be informative.

BIC focuses on this task of consistently estimating k^*. BIC also represents a correction to the log-likelihood that depends on the number of datapoints and the dimensionality of the model. However, BIC doesn’t aim to correct the bias that is present in that quantity. Its aim is that we can recover k^* by finding the model with minimum BIC. This gives us a consistent estimator of k^*: when the number of datapoints n is large enough, we recover the correct value with probability 1.

However, this makes BIC extremely restricted: we can only use it if we assume that somehow we have captured the truth inside one of our models M_k. This is a very bold assumption, and it is 100% wrong, unless you have generated the data yourself. Thankfully, BIC can also be applied to a slightly more realistic case. This is the case in which the model chain is such that, after a certain k^*, the models stop improving: model M_{k^*+1} is exactly as good as model M_{k^*}. This can only happen if, for some reason, the extra flexibility is not needed, even though model M_{k^*} is not the true model. That could happen. For example, let us return to the regression model. Imagine if the true model is indeed quadratic, but the noise model you are using is incorrect. Then, all models beyond quadratic won’t give an improvement. Let us refer to this case as k^* being the index of the quasi-true model. Thankfully for its use, BIC also correctly recovers a quasi-true model (in that it gives us a consistent estimator for it).

It won’t go into details, but like the name indicates, BIC is an approximation to a Bayesian idea. More precisely, BIC is a very rough approximation of the log-posterior distribution over models when the prior is uniform over all models AND when the prior inside each model over the parameters is also flat. Honestly, as an alternative to BIC I would thus use a more Bayesian method with:

  • a realistic prior on the models which ranks more complicated models as less likely.
  • A realistic prior on the parameter space
  • Better approximations than the horrible ones that are used in BIC

This would have the exact same guarantees as BIC (the asymptotic behavior would be identical) while being more principled (at least, appearing more principled to me) before the asymptote.

 

Thus, it turns out that AIC and BIC are actually slightly different beasts. AIC is all about estimating a “fitting score” for each model in an unbiased fashion. It is thus extremely general. We can then use the unbiased score to decide between the various models at hand, or construct confidence intervals, etc. In contrast, BIC can only be used if, for some reason, we suspect that we are in a situation in which one model M_{k^*} is true or quasi-true. Then, we can use BIC to recover k^*. This makes BIC way less useful.

 

Advertisements