The difference between Bayesian and frequentist

Today, I’ll talk about the “Big Ideological divide” in Statistics: the Bayesian method versus the frequentist method. I won’t tell you that one is better, I’ll just try to give a short overview of what the two terms mean.

Many people have done this over the years, but I really don’t like the way they present it. Usually, the difference between the two is presented as a difference of philosophical interpretation of probabilities which:

  • makes it sound like a sort of religious divide
  • makes it sound like the two approaches are impossible to reconcile

That second point is very important because there is a lot to be gained by knowing both approaches. I won’t go into too much detail but some frequentist estimators can be re-interpreted with a Bayesian point of view, in decision theory a good way to find a minimax estimator is to find a Bayes optimal estimator with some particular properties, you can do (and should do !) frequentist analysis of Bayesian methods, etc.

It seems to me that the difference between the two is actually that you ask different questions when you have your Bayesian hat on, or when you have your frequentist hat on. When you have a dataset and you try to come up with a method to structure the data-set and/or to predict future datapoints, one way you could do this is by following a Bayesian approach (choose a prior, a likelihood, etc). Once you have a method in mind, you should check if the method works well on various datasets which corresponds to asking questions about the frequentist properties of the method. Let’s illustrate this with an example.

Is the coin fair or not

You are playing heads or tails with a friend, and you want to figure out if the coin you are playing with is fair or not: ie, you want to check what’s the probability of getting heads on the next flip. This example is basically the “Hello World” of statistics.

Designing an analysis with the Bayesian method

Let’s start with designing a method to analyze the data, using Bayes formula. Consider the following generating model:

We first pick at random \theta, which will be the probability of giving a “Heads” result for our coin. We then generate n heads or tails results, according to n independent Bernoulli variables of parameter \theta, or equivalently, we pick the number of heads at random from a multinomial distribution of parameters (n,\theta). We’ll note m the number of heads.

For the distribution of \theta, we will take the uniform distribution over [0,1]. The joint probability of \theta and m is thus, as a function of the parameter n:

\displaystyle p(\theta,m;n) = (n m) \theta^{m} (1-\theta)^{n-m}

Let’s now come back to our coin. We flip the coin n=30 times and observe m=13 heads inside that sequence. The posterior distribution of \theta for that dataset is found to be:

\displaystyle p(\theta|m;n) \propto \theta^{m} (1-\theta)^{n-m}

which we recognize as a Beta distribution, a well behaved distribution for which we know the mean, the variance, the mode and all other sorts of properties.

Given this data, what should we believe the value of \theta, the probability that the coin give heads, to be ? We can actually give several answers to this question, depending on exactly what we wish to accomplish. If we want to maximize our probability of being right, we should choose the mode of p(\theta|m;n). This is called the maximum a posteriori (MAP) estimator. In this model, it has the following value:

\displaystyle \theta_{MAP} = \frac{m}{n}

Because our prior distribution is uniform, the MAP estimator is also the maximum of the likelihood function \theta \rightarrow p(m|\theta;n) , so it’s also the maximum likelihood (ML) estimator. If the prior was different the MAP estimator and the ML estimator would differ But maybe we would want instead to minimize the squared error of our estimator, ie: find a value that is more representative of the global shape of the posterior. This minimum mean-squared error (MMSE) estimator is always equal to the expected value of \theta under the posterior distribution. In our model, it has the following value:$

\displaystyle \theta_{MMSE} = \int \theta p(\theta|m;n) d\theta = \frac{m+1}{n+2}

The MMSE estimator, in this specific model, also corresponds to the predicted probability of getting a heads in the (n+1)^th observation (which I leave as an exercise for the curious reader).

This concludes our section on the Bayesian method: we have designed two slightly different estimators for the value of \theta when we observe some dataset with a given number of heads and tails. Note that assuming a different prior distribution for \theta would give us slightly different expressions for the estimators.

Assessing the frequentist properties of the estimators

Let’s now put on our frequentist hat, and check whether our estimators have good properties. Let’s first assume that our generative model is correct, and let’s note \theta_0 the true value of the coin that generated the dataset we are looking at. We throw this coin n=30 times and obtain a number of heads m, and compute the estimators \theta_{MAP} and \theta_{MMSE}. m is distributed as a binomial distribution with parameters (n,\theta_0). In particular, if we condition on a value of \theta_0, m has mean n\theta_0 and variance n\theta_0(1-\theta_0).

We can translate these results to the MAP estimate: \theta_{MAP} is always unbiased (ie: E(\theta_{MAP}|\theta_0)=\theta_0) and its variance is always smaller than n/4. These two properties are important because they tell us that, when we compute \theta_{MAP}, we can expect the true value \theta_0 to be in a small neighborhood around \theta_{MAP}, ie: we can construct a confidence interval.

We can also compute the mean and variance for the MMSE estimator:

\displaystyle E(\theta_{MMSE}|\theta_0) = \frac{n\theta_0+1}{n+2} = 0.5 + \frac{n}{n+2}(\theta_0-0.5)

\displaystyle var(\theta_{MMSE}|\theta_0) = \frac{n}{(n+2)^2}\theta_0(1-\theta_0)

from which we can see that the MMSE estimator is biased towards 0.5 and, because of the bias-variance trade-off, this enables him to have smaller variance than the MAP estimate.

Another interesting analysis, which I won’t do here because it is a bit long and I couldn’t explain properly, consists in analyzing the method when the data we are applying the method to isn’t generated by our probabilistic model. If you are interested, it’s a great exercise. For example, if the data has temporal correlation such as in a one-step markov chain (which seems like a reasonable model for the coin-tossing procedure). This problem is called model miss-specification and is an important issue that needs to be addressed in Bayesian statistics.

They are essentially two big consequences to miss-specification. First of all, the true model (eg: the markov chain) doesn’t have a \theta parameter anymore, so our estimators can’t estimate that in any way. What they do estimate is the probability of heads in the equilibrium distribution of the markov chain. Second consequence, the variance of the estimators increases and can be quite bigger than the upper bound we derived before ! So that means that if we are not careful in only applying our model to truly independent data, then whatever confidence interval we derive might from our theoretical analysis, with this key assumption on independence, can be completely off.

How could we assess how bad miss-specification can be for a Bayesian method without this sort of frequentist approach ? We can’t ! This is why I really believe that the two methods should be considered together, and not on their own.

Summary and conclusion

Hopefully, you now see better the distinction I want to make between a Bayesian method, which deals with a specific way of constructing methods / algorithms to answer probability questions, and a frequentist analysis of a given method, which deals with checking if and under which assumptions on the data-generating process, the method works well. My personal opinion is that these are two complementary approaches and that it is short-sighted and damaging to ignore one half of the wonderful field of statistics.

Finally, note that the Bayesian method is one way to construct estimators and statistical methods. Another way is to start by specifying frequentist properties that we want our estimator to satisfy, and to then find an estimator that satisfies these conditions. It’s thus possible to work outside of the Bayesian paradigm.


Why statistics are awesome

I will start this blog by talking about why statistics are pretty much the coolest domain of math. To sum up, it’s because probability is the language of reason in a stochastic world.

You might be surprised by my thesis: isn’t logic the true language of reason ? Logic is indeed a language of reason, but one that only applies to deterministic worlds ! Logic is perfectly suited to the realms of mathematics, in which everything is determined by absolutely rigid rules from which you can’t escape.

However, if we need to reason in a world in which relationships are not completely deterministic, we need a language which has more expressive power than logic, and this is where probabilities and statistics come into play. These two are very related, but if we really want to make a distinction, we could say that probabilities deal with saying how we should act to accomplish a goal efficiently (eg: what is a good poker strategy), whereas statistics deals with what we should believe in (eg: is this pattern I’m seeing significant or not). Not only are statistics a language that works at describing inferences in a stochastic world, one can also show that Bayesian statistics actually contain logic as a special case.

Statistics thus has a very important role to play in the scientific method. Most of science is basically trying to find patterns and checking whether established patterns hold up in new situations. For example, if you wanted to test whether Newton’s theory of gravitation is a better account of reality than Einstein’s, you would design an experiment in which the two theories give different predictions, collect data, put on your statistician hat and check which theory the data agrees with (if it agrees with any).

But statistics actually play a much wider role in this world: every human on earth (and most animals) is an intuitive statistician, collecting data about the regularities of his environment, and trying to act on those. This statistical knowledge is a fundamental component of the behavior of every being on this planet. For example, you know intuitively about what’s normal weather in the city you live, and you can probably predict the weather for the next few days because you have observed the regularities of the weather throughout your life; you know intuitively how your friends and family react to a wide variety of situations because you have lived through a wide variety of situations with them, and even for a person you’ve just met, you can guess how he would act, because he’s most likely going to act like somebody you already know ! Our world is filled with patterns, and it seems like it was particularly beneficial for animals to recognize those in order to increase their survival. A lot of animals thus seem to have “learned” (through evolution) intuitive statistics, through the prism of which they interpret the world.

To conclude, when we study statistics, we are studying the fundamental language of reason, which is not only at the heart of science, but also inside the head of every creature and of every human of this little earth. If that’s not cool, then I don’t know what is.

If you want to read more on this, I can only recommend E.T. Jaynes’s book, which is absolutely awesome, and which makes this point much better than I ever could.