Some remarks on shrinking and the Stein phenomenon

I’ve been reading on the bias-variance trade-off and, most importantly, on the Stein phenomenon. Here are some of my thoughts on the subject which I hope can help others with this slightly thorny subject.


First, let us set the stage. We want to infer the value of some d-dimensional parameter: \theta. In order to do so, we are given a single observation X which corresponds to \theta corrupted by Gaussian noise with covariance the identity matrix:

\displaystyle{ X = \theta + \eta }

How should we estimate \theta from X ?


A natural idea consists in using X itself. This is the maximum-likelihood estimator of \theta and is indeed a good estimator of \theta. It is the best estimator that is translation-invariant; it is minimax; etc.

However, Stein has shown that X is not actually perfect: there exists a whole family of estimators which are better than it, if d \geq 3. These are estimators of the form:

\displaystyle{ \hat{\theta} = \left(1 - \frac{d-2}{\|X -\theta_0 \|} \right) (X-\theta_0) +\theta_0}

No matter the true value of \theta, these estimators always have lower Mean-Squared Error than X. In other words, they always do a better job! In sense, it is thus slightly “stupid” to use X instead of them, because you are going to make bigger errors by doing so.


The reason this occurs is due to the strong power of biasing your estimators in high-dimensions. Biasing causes an increase in error, but causes a stronger reduction in the variance, and is thus beneficial. The Stein-estimator creates a bias towards \theta_0 which is able to improve the performance of the estimation. While, other biased estimators such as adding a L_1 or L_2 penalty (in machine-learning’s horrible jargon, these are known as Lasso and Ridge penalties) would also improve over X when close enough to \theta_0, the more general estimator X is better than them when we are far away. However, the Stein-estimator is always better since its bias vanishes when X is far from \theta_0.


When I first saw this result, I was very perplexed by the very peculiar form of the Stein-estimator. There is this tendency in math to present properties like this one as magical: “here is this one guy that just happens to have a crazy property”, instead of detailling where the property comes from. The Stein-estimator is usually presented in this manner, but it does have an interesting origin. Indeed, it is a actually close to a Bayesian estimator (more precisely, he can be derived from an empirical Bayes approach. Since it isn’t truly Bayesian, it should mean that there exists another estimator that dominates it). This makes a little bit of sense, since Bayesian estimators are known to have good Mean Squared Error. Of course, I’m still very curious whether their could be other biases which result in estimators which dominate X everywhere. I’m just never satisfied with a single counter-example: I want to know the set of all counter-examples.


Thus, the Stein phenomenon gives us the high-level lesson that “Bias is good (in high-dimensions)”. However, there remains one question: in which direction should we bias our estimate? This is an interesting question.

Indeed, if we have no idea which \theta_0, one idea is to choose randomly around X which constitues a natural first guess for \theta. However, that is stupid, by the following argument:

  • Biasing our estimator towards \theta_0 is only good compared to X if \theta_0 is pulling us towards the true value \theta.
  • If we choose randomly around X, half of the values that we choose are going to be further away from \theta than X
  • Thus, randomly choosing our bias direction is a bad idea: it’s going to be worse than X (see: for a more detailled presentation of that idea).

Thus, biasing is indeed good if we have true prior information about the direction in which we should bias our guess. If we don’t, then it is better to use the unbiased estimator X since a randomly chosen bias is unlikely to actually achieve a reduction in error.