Today, I’ll talk about the “Big Ideological divide” in Statistics: the Bayesian method versus the frequentist method. I won’t tell you that one is better, I’ll just try to give a short overview of what the two terms mean.
Many people have done this over the years, but I really don’t like the way they present it. Usually, the difference between the two is presented as a difference of philosophical interpretation of probabilities which:
- makes it sound like a sort of religious divide
- makes it sound like the two approaches are impossible to reconcile
That second point is very important because there is a lot to be gained by knowing both approaches. I won’t go into too much detail but some frequentist estimators can be re-interpreted with a Bayesian point of view, in decision theory a good way to find a minimax estimator is to find a Bayes optimal estimator with some particular properties, you can do (and should do !) frequentist analysis of Bayesian methods, etc.
It seems to me that the difference between the two is actually that you ask different questions when you have your Bayesian hat on, or when you have your frequentist hat on. When you have a dataset and you try to come up with a method to structure the data-set and/or to predict future datapoints, one way you could do this is by following a Bayesian approach (choose a prior, a likelihood, etc). Once you have a method in mind, you should check if the method works well on various datasets which corresponds to asking questions about the frequentist properties of the method. Let’s illustrate this with an example.
Is the coin fair or not
You are playing heads or tails with a friend, and you want to figure out if the coin you are playing with is fair or not: ie, you want to check what’s the probability of getting heads on the next flip. This example is basically the “Hello World” of statistics.
Designing an analysis with the Bayesian method
Let’s start with designing a method to analyze the data, using Bayes formula. Consider the following generating model:
We first pick at random , which will be the probability of giving a “Heads” result for our coin. We then generate heads or tails results, according to n independent Bernoulli variables of parameter , or equivalently, we pick the number of heads at random from a multinomial distribution of parameters . We’ll note the number of heads.
For the distribution of , we will take the uniform distribution over . The joint probability of and is thus, as a function of the parameter :
Let’s now come back to our coin. We flip the coin times and observe heads inside that sequence. The posterior distribution of for that dataset is found to be:
which we recognize as a Beta distribution, a well behaved distribution for which we know the mean, the variance, the mode and all other sorts of properties.
Given this data, what should we believe the value of , the probability that the coin give heads, to be ? We can actually give several answers to this question, depending on exactly what we wish to accomplish. If we want to maximize our probability of being right, we should choose the mode of . This is called the maximum a posteriori (MAP) estimator. In this model, it has the following value:
Because our prior distribution is uniform, the MAP estimator is also the maximum of the likelihood function , so it’s also the maximum likelihood (ML) estimator. If the prior was different the MAP estimator and the ML estimator would differ But maybe we would want instead to minimize the squared error of our estimator, ie: find a value that is more representative of the global shape of the posterior. This minimum mean-squared error (MMSE) estimator is always equal to the expected value of under the posterior distribution. In our model, it has the following value:$
The MMSE estimator, in this specific model, also corresponds to the predicted probability of getting a heads in the observation (which I leave as an exercise for the curious reader).
This concludes our section on the Bayesian method: we have designed two slightly different estimators for the value of when we observe some dataset with a given number of heads and tails. Note that assuming a different prior distribution for would give us slightly different expressions for the estimators.
Assessing the frequentist properties of the estimators
Let’s now put on our frequentist hat, and check whether our estimators have good properties. Let’s first assume that our generative model is correct, and let’s note the true value of the coin that generated the dataset we are looking at. We throw this coin times and obtain a number of heads , and compute the estimators and . is distributed as a binomial distribution with parameters . In particular, if we condition on a value of , has mean and variance .
We can translate these results to the MAP estimate: is always unbiased (ie: ) and its variance is always smaller than . These two properties are important because they tell us that, when we compute , we can expect the true value to be in a small neighborhood around , ie: we can construct a confidence interval.
We can also compute the mean and variance for the MMSE estimator:
from which we can see that the MMSE estimator is biased towards 0.5 and, because of the bias-variance trade-off, this enables him to have smaller variance than the MAP estimate.
Another interesting analysis, which I won’t do here because it is a bit long and I couldn’t explain properly, consists in analyzing the method when the data we are applying the method to isn’t generated by our probabilistic model. If you are interested, it’s a great exercise. For example, if the data has temporal correlation such as in a one-step markov chain (which seems like a reasonable model for the coin-tossing procedure). This problem is called model miss-specification and is an important issue that needs to be addressed in Bayesian statistics.
They are essentially two big consequences to miss-specification. First of all, the true model (eg: the markov chain) doesn’t have a parameter anymore, so our estimators can’t estimate that in any way. What they do estimate is the probability of heads in the equilibrium distribution of the markov chain. Second consequence, the variance of the estimators increases and can be quite bigger than the upper bound we derived before ! So that means that if we are not careful in only applying our model to truly independent data, then whatever confidence interval we derive might from our theoretical analysis, with this key assumption on independence, can be completely off.
How could we assess how bad miss-specification can be for a Bayesian method without this sort of frequentist approach ? We can’t ! This is why I really believe that the two methods should be considered together, and not on their own.
Summary and conclusion
Hopefully, you now see better the distinction I want to make between a Bayesian method, which deals with a specific way of constructing methods / algorithms to answer probability questions, and a frequentist analysis of a given method, which deals with checking if and under which assumptions on the data-generating process, the method works well. My personal opinion is that these are two complementary approaches and that it is short-sighted and damaging to ignore one half of the wonderful field of statistics.
Finally, note that the Bayesian method is one way to construct estimators and statistical methods. Another way is to start by specifying frequentist properties that we want our estimator to satisfy, and to then find an estimator that satisfies these conditions. It’s thus possible to work outside of the Bayesian paradigm.