# Happy new year !

It’s now 2018: a brand new year starts ! Like most people, I’d like to take time to think back on 2017 and to decide on new year’s resolutions (which I’ll try not to forget too quickly).

My greatest success in 2017 was that I managed to turn around my teaching: in 2016, I taught for the first time (an introductory course covering probability and statistics for engineers)  and I didn’t went great, as I have talked about in earlier posts. This year, the course was very well received by my students, and I felt great about my lecturing. I hope I can keep up the good work and have my lecturing be as good next year!

However, since I’m also very critical, I’m still not completly satisfied with my teaching material: this is something I would like to improve for next year (and I’m looking forward to the summer, during which I will actually have the time to work on it). My biggest worry is that the current material is a bit dry: it involves very little doing or discovering by the students. As a consequence, I’m concerned about students feeling that this is too scholarly and not relevant to real-life. I’m thus going to try to figure out how to get them to practice more next year, probably by giving out programming exercises.

In 2017, I also had some great ideas for my research (which I’ll try to write about in future posts). More and more, I’m drawn towards deep learning (exactly like a moth to a flame: let’s hope for my sake that this story ends up better for me): I’m looking forward to trying to get some real research on this fascinating topic. I also hope to be able to dive into the exciting world of Functional Data Analysis, in which you wonder how statistics would proceed if our observations were random functions: $t \rightarrow X(t)$ instead of vectors. I’m still not completely sure how I feel about that: I oscillate between the intuition that this is a better approach and the reality that this leads to very counter-intuitive properties.

Finally, 2017 was the year I first supervised a student on a long-term project: I supervised a Master student of EPFL on his final project. I’m amazed at how great this turned out. My student was very sharp and worked super-hard. I’m super happy about the results he got and we’ll be submitting them to NIPS next year (how many master’s project end up there??). I’m just a little bit sad because he’s leaving academia for a more normal job involving statistics. I’m sure he’ll do great work there and I wish him the best on his way.

I look forward to 2018. I’m confident that this will be a great year for me: I’ve got tons of projects for my professional life, both for research and for teaching. My only worry is whether I’ll have the time to do all of them ! This leads me to my only new year’s resolution: trying to be more efficient with my time !

Happy new year ! I wish you all the best.

# In praise of intuition

There are two ways to study mathematics. One is based on emphasizing rigor, and I’ll abusively refer to this as the mathematician’s approach. The other one, possibly less well known or at least less well received, is more based on intuition. Abusively, I’ll refer to this at the physicist’s perspective. With this blog post, I aim to explain why you should try to incorporate more intuition into your mathematics.

At first, it might seem impossible for intuition to be a useful skill in mathematics. Indeed, if there is one idea that is central to mathematics, it is the idea of rigor. One perspective on math is that it is a game of manipulating initial truths (postulates) through the rules of logic in order to yield new truths. The key goal of this game is that every manipulation that we do be perfectly rigorous, which is why we willfully bind ourselves to the very constraining rules of logic and to very precise definitions, constructions, etc.

Intuition is the exact opposite of rigor. Rigor carefully and exhaustively travels from point A to point B and makes sure that no stone is left unturned. Intuition instead makes far reaching leaps which are based on nothing but hot air and bravado. It would then seem extremely counter-productive to try to use intuition in mathematics which is essentially rigor embodied.

However, thankfully, mathematics is not only about rigor. Indeed, while mathematics is focused on only making statements that are true, it’s also focused on making statements that are interesting. Indeed, consider the following two statements:

• $\displaystyle{ x^2 = 4 \text{ is solved by }x=\pm 2 }$
• $\text{Any equation of the form: }a x^2 + bx + c = 0 \text{ has two solutions in } \mathbb{C} \text{ (you know the formula so I won't write it)}$

Both are true. However, one of them is trivial (not completely so: checking that these two solutions are correct is easy; proving that this exhausts the solutions is more difficult), while the second one isn’t. In fact, the second statement contains infinitely many trivial statements of the first kind. It thus constitutes a much more interesting true statement than the first one !

We then need to consider what makes a true statement “more interesting” or “less interesting”. One key factor is the generality of the statement. If one statement is a particular case of another one, then the second statement is clearly “more interesting”.

This is where intuition comes into play. Intuition enables us to identify patterns. With intuition, we can observe that statements a1, a2, a3 are true, and then conjecture that a general statement A, for which a1, a2, a3 are particular cases, might also be true. However, we don’t stop there, since mathematics is a game of rigor. We return to full mathematical rigor and try to prove that our conjecture is indeed true.

When I was a kid, I loved labyrinths: I had a few books full of them which I loved to go back to. One lesson I learned on them actually nicely illustrates my point: I quickly noticed that it is often useful to actually try to go through the labyrinth the opposite way, starting from the end and returning to the beginning. This is precisely the role that intuition can play in mathematics: it enables us to start from conjectures and to backtrack from them to statements that we know to be true.

In practice, I find it even better to mix the two approaches together. I’ll often start from an intuited conjecture and trying to prove it. In trying to prove it, I improve my understanding of the problem, and I identify problems with the initial conjecture. This means I can sharpen my intuitions in order to build better conjectures which I then try to prove, etc.

# Writing a punchy paper

I’m currently preparing a paper for the (very great) “Advances in Approximate Bayesian Inference” NIPS workshop (here is their webiste). The constraints are pretty high as the paper can only be 4 pages long (we can provide supplementary information in excess of that, thankfully).

I’ve thus spent this morning trying to make that paper punchy, which gives me a good excuse to detail (and thus think about) my writting process. Let’s dive into it.

### High-level description

Before I start writting down the paper, I like to start with a simple list of high-level questions that gives a synthetic description of the paper.

• What is the context of the article.
• What is the precise issue that I’m adressing.
• What does the reader learn from my paper.
• What background information does he need to understand my paper. (Bonus question: will most readers have that background information? Do I to expand my paper to reduce the requirements?)
• Why does the reader care. (The most important question! Even if you know your paper isn’t groundbreaking, give your readers a reason to care).

Most often, I also think of how I would describe the paper in two-three sentences to a colleague. Conveniently, this can be done by stringing together the answer to all the “big questions” which I just presented. For my article, we get:

In order to do Bayesian machine learning, we most often use approximations, but we don’t have results quantifying whether they are good or not. This article is a theoretical contribution giving a computable measure of the quality of one approximation.

The result refers to slightly exotic measures of distance between probability distributions, but nothing too crazy.

This gives a powerful tool for checking whether our approximations are good or not for little extra cost.

This casual abstract of the paper is valuable because it makes explicit all of the important points of the paper. Hopefully, this will prevent us from forgetting them!

### Structuring the paper

The next thing I do is I organize my paper.

I always start with an introduction. The objective of it is to tell the reader why he should care about reading the rest of the paper. We need to recall the global and local context of the problem we address, and advocate why it is an interesting one. We also need to tell him what lessons we can learn from studying it. It is fine to tease in the introduction the results that we are going to derive.

I usually adopt the following structure:

I like to make this paragraph interesting even for scientists that are in close domains (for example, if my article is in Bayesian statistics, I’d want frequentist statisticians to still be interested in it, so I’ll make sure my introduction caters to them).
It is perfect if the first sentence of the introduction is a “hook”: a provocative / thought-provoking summary of the global context. However, this is the perfect case. We do not necessarily always get there.
• Local context (one-two paragraphs): what precise problem are we tackling?
This is where I do my litterature review usually: we recall other earlier solutions that are present in the community, and explain how we can improve over them.
In this paragraph, I give a quick recap of the solutions that the article provides. I want to have the reader be interested in what I’m writting, and this seems like a good way: we are telling him what he will gain from reading our article.
• Structure of the article (one paragraph): I finish my introduction by detailling how the article is organized.

The next step is organizing the body of the article. This is heavily dependent on what the article is about, but I’m usually looking for the following features:

• Causality: my reader shouldn’t have to jump around my article when reading it. This means that I’ll start with the simplest points, and then continue with the more complicated ones. If I need to provide background information that the reader might be lacking, I’ll do so at the beginning of the article, etc.
• Flow: as much as possible, ideas presented in the article should (appear to) flow naturally from one to another.
I don’t want my reader to get stuck on a rough transition, and I want to understand why a section is finished and why I’m going to the next one.
• Transparency: my goal is always for the reader to understand what I’m doing. I thus go overboard with stating my goals withing sections and recapping what we’ve learned, etc.
While the structure is immediate to us the writer, it might not be such to our reader. My greatest fear is my reader getting lost in my paper.
• Ease of reading: an over-arching principle of my writing is that I want to make it easy for my reader. If I have a decision to make on my paper, my focus is always on what makes it the easiest to read.

These are general objectives. Since I mostly write about statistical theory, I also have a few ideas which I think are specific to that theme:

• Intuitions are better than proofs: my objective is to write for most readers so that my paper speaks to as large an audience as possible. Most readers won’t dive into proofs, or won’t gain a lot by reading them (even very technically competent reader) because proofs are boooooorrrinnng.
In my theoretical papers, my objective is thus to give intuitive derivations of my results. These intuitive derivations aren’t fully rigorous proofs, but they are “lighter” than one. As such, they are much easier to digest for my readers and they might commit them to memory easier.
I delegate fully rigorous proofs to supplementary information.
• It’s better to have too much background: say my proof relies on some background information, I’d rather present it in my article instead of relying on my reader knowing it.
Indeed, if the reader already knows it, he might still benefit from a refresher / different perspective on the topic. At worse, he’ll be bored but, if I’ve done a good job describing the structure of my paper, he’ll feel free to skip the sections in which he already knows and nothing will be lost.
If the reader doesn’t know it, then it would be catastrophic to not present the background, as it means that he would either not understand my paper, or he’d need to take a break from it (and thus risk never coming back).
Thus, adding more background as little cost but heavy benefits.
• Be punchy instead of being fully general: this is related to my first point about intuitions, but I want to focus here more on the statement of our theorems. In many cases, a theoretical statement can be made in many different variants. For example, we might be able to weaken our assumptions to obtain a more general theorem, etc.
There is a mistake here that I wish to avoid: focusing too much on generality. My worry here is confusing my reader by stating a theorem that is too general, too quickly. I’d rather ease him into the fully general result by starting from a more intuitive and easier-to-understand theorem.
This is counter-intuitive because, as mathematicians, we are always told that generality is always better. This isn’t the case. Generality can also be harder to understand because they are more “moving pieces” in the result. Instead by going from simple to general, we can help the reader focus on the more important parts of the result first, so that he doesn’t feel lost when facing the general theorem.

Finally, there is the concluding sections. I include in this the discussion section and the true conclusion.

My objective here is the make the contribution of the article clear and memorable, and to highlight interesting potential follow-ups. Since I focus on theory, this is a good place to give an example of an application of the theorem presented in the article.

### Writing !

This section is pretty straightforward: you sit down in front of your computer, get into the zone and WRITE!

Here is some useful but minor advice that work well for me:

• Avoiding the “blank page syndrom”: it’s hard to get started. Usually, my problem is that all my ideas just seem to be really bad.
My trick is to just power through it: I sit down and start writing whatever ideas pop into mind for the introduction. As I’m writing down, I might feel like my output is really bad, but I ignore those feelings. I’ll then come back to the section later to improve it.
• Write – discard – repeat: when I decide to rewrite a section (which occurs a lot because my first drafts tend to be poor), my approach is usually to re-read the section, then copy-paste it somewhere safe, and start from scratch. This way, I don’t get lost in the details of modifying a document, which I find to be exhausting. What I have instead is a blank slate, and some ideas of what went well and what went poorly in the preceding version. I find it much easier to write this way.
• Write it out, then edit: I usually never start editing a section before I finish my first draft of the whole document. I find that you have a much clearer picture of the document once you’ve wrote it out (at least) once. Thus, I refrain from trying to make the sections better before I have that clearer picture.
• Write quickly and edit many times: for me, it works better to rewrite one section quickly many times than to think about it for hours carefully and slowly writing down a masterpiece by weighting each word. I’d rather go quickly many times than carefully once. That’s just how it works for me: find what works for you.
• Get feedback: try to gather as much feedback from your colleagues as you can. This can mean giving them a draft of the article to read, if they’re very nice, but it can also simply be discussing the structure of the article, or the “punchyness” of an argument, explanation, etc.

Good luck with writing! In the end, it is just another skill that you need to practice to get good at it. I hope that this can help you get there faster.

# Understanding AIC and BIC

I’ve spent the better part of the day trying to understand AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion; note that this is a misleading name: BIC has nothing to do with information theory). I’m quite happy to have understood both (partially). Let me try to re-explain it to you.

First of all, let’s understand what we are talking about. We have some random variable $X$ which we are trying to model with various models $M_k$. These models are parametric probabilistic models: they specify a function $p(X|\theta_k, M_k)$ (note that, critically, the dimension of $\theta_k$ varies between models). AIC and BIC both deal with ideas of how to choose an appropriate model $M_k$. For a simple example, consider modeling $X$ as a Gaussian ($k=1$), or a mixture of two Gaussians ($k=2$), or of three, etc.

If we had access to the exact probability distribution of $X$, one thing we could do to compare these various models $M_k$ is the following:

• First, find the value of the parameter $\theta_k$ such that the probabilities of $p(X|\theta_k,M_k)$ are the closest to the truth.
• Second, report the distance between the truth and the best approximation in model $M_k$.
• Third, use this to rank the various models.

One sensible notion of distance we could use is the KL divergence:

$\displaystyle{ KL(p(X), q(X) = \int p(x) \log{(\frac{p(x)}{q(x)})} dx}$

which has the benefit of having nice computational properties.

In practice, we won’t be able to compute this KL divergence because we do not have access to the true probability distribution of $X$. What we have most often access to is IID samples from $X$. What we can then do is use these samples to construct an unbiased estimate of the KL divergence between the truth and the best probability distribution inside model $M_k$. This is precisely what the AIC offers.

More precisely, we do not estimate the KL divergence. We estimate the expected log-likelihood of the best probability distribution inside model $M_k$. This quantity is equal to the KL up to one unknown common constant which is the entropy of the true distribution. Thus, a ranking of models based on the expected log-likelihoods has the same order as one based on KL divergences. However, we can’t use the log-likelihood at the MLE $\theta_k^*$ because that value is biased. What Akaike did was he computed the asymptotic bias and the AIC gives a corrected value which removes this bias. This correction depends on the dimensionality of the parameter $\theta_k$ and on the number of datapoints in the dataset. Note that other more advanced criteria also exist. The only one worth mentioning is the AICc which is a slightly improved corrections for small datasets.

Thus AIC is about correctly estimating the quantity we should care about for choosing an appropriate model. AIC doesn’t deal directly with choosing an appropriate model. However, we could use the unbiased estimates to then select a model which gives a good account of the data. AIC can also be applied to estimates of a minimum inside a parametric class which are unrelated to likelihoods.

BIC deals with a slightly different problem. Assume that we have a very large (or maybe infinite) amount of nested models: model $M_{k+1}$ is a more complex version of model $M_k$. For example, consider performing linear regression while expanding the set of predictor variables, or the example I already gave of a mixture of $k$ Gaussians. In general, the true probability distribution won’t fall inside our class of models. Thus, model $M_{k+1}$ will always be a better model than model $M_k$ because its increased flexibility will allow it to capture the complexity of the data better. In such situations, the BIC is inappropriate to use.

However, in some extremely rare examples, it might happen that the true probability distribution is actually inside model $M_{k^*}$ (and thus also inside all further models). We could then try to recover this $k^*$. For example, for linear regression, the true model might be a quadratic polynomial. Thus, trying to fit a third degree polynomial just provides extra degrees of freedom which are not needed. We might then try to learn from the data that a second degree polynomial is sufficient as that would be informative.

BIC focuses on this task of consistently estimating $k^*$. BIC also represents a correction to the log-likelihood that depends on the number of datapoints and the dimensionality of the model. However, BIC doesn’t aim to correct the bias that is present in that quantity. Its aim is that we can recover $k^*$ by finding the model with minimum BIC. This gives us a consistent estimator of $k^*$: when the number of datapoints $n$ is large enough, we recover the correct value with probability 1.

However, this makes BIC extremely restricted: we can only use it if we assume that somehow we have captured the truth inside one of our models $M_k$. This is a very bold assumption, and it is 100% wrong, unless you have generated the data yourself. Thankfully, BIC can also be applied to a slightly more realistic case. This is the case in which the model chain is such that, after a certain $k^*$, the models stop improving: model $M_{k^*+1}$ is exactly as good as model $M_{k^*}$. This can only happen if, for some reason, the extra flexibility is not needed, even though model $M_{k^*}$ is not the true model. That could happen. For example, let us return to the regression model. Imagine if the true model is indeed quadratic, but the noise model you are using is incorrect. Then, all models beyond quadratic won’t give an improvement. Let us refer to this case as $k^*$ being the index of the quasi-true model. Thankfully for its use, BIC also correctly recovers a quasi-true model (in that it gives us a consistent estimator for it).

It won’t go into details, but like the name indicates, BIC is an approximation to a Bayesian idea. More precisely, BIC is a very rough approximation of the log-posterior distribution over models when the prior is uniform over all models AND when the prior inside each model over the parameters is also flat. Honestly, as an alternative to BIC I would thus use a more Bayesian method with:

• a realistic prior on the models which ranks more complicated models as less likely.
• A realistic prior on the parameter space
• Better approximations than the horrible ones that are used in BIC

This would have the exact same guarantees as BIC (the asymptotic behavior would be identical) while being more principled (at least, appearing more principled to me) before the asymptote.

Thus, it turns out that AIC and BIC are actually slightly different beasts. AIC is all about estimating a “fitting score” for each model in an unbiased fashion. It is thus extremely general. We can then use the unbiased score to decide between the various models at hand, or construct confidence intervals, etc. In contrast, BIC can only be used if, for some reason, we suspect that we are in a situation in which one model $M_{k^*}$ is true or quasi-true. Then, we can use BIC to recover $k^*$. This makes BIC way less useful.

# Noise in the publishing system

Today, I’d like to rant on what I perceive to be flaws of the current system of publishing (or, more accurately, on the flaws of how people treat our current system). Of course, like all rants from young people, please take it with a large pinch of salt: I know I know nothing, but I’d just like to be able to vent.

What annoys me is the fact that so many publications are either low-quality (the same work could be presented in a much clearer fashion) or low-effort (the work represents a marginal improvement over the existing state of the art). A small note: I’m more than fine with incremental work: it is an essential stepping stone in science. Most everything we do is definitely not a breakthrough. However, what is extremely annoying is when the authors aren’t straightforward about how their work is incremental. Some results are presented as if the authors are offering a revolutionary approach, even though it’s just the same old crap that they are re-hashing for the third time.

These two flaws make it so that reading articles is extremely unenjoyable and much harder that it has to be: when I’m reading, I want to absorb new knowledge. I really don’t want to fight against the authors to decode whatever they meant, and I really really don’t want to have to remain hyper-attentive to decipher which parts are new and what is old stuff that I already know (and that the authors are probably butchering in their attempts at obfuscation).

I don’t know where these flaws come from and how to fix them. I’m guessing that part of the problem is that researchers fell under so much pressure to produce new articles in order to secure funding/positions/etc. As such, they need to cut corners and this explains the rushed articles and why they are trying to make their contribution sound more impressive than it is (which makes it so that their article gets accepted).

What I can do is I can strive to ensure that my contributions don’t have these flaws (oh the arrogance of youth). I’ll try as much as possible to have my contributions be as clear as I can make them (and I’ll take the time to ensure that this happens: I won’t rush to get something out if it isn’t ready). And, when I do some incremental work, I’ll make sure that I properly document exactly how it is positioned compared to the litterature AND I’ll use such occasions to try to clarify the existing literature. I’ll do so by treating the corresponding article as a tutorial, with the objective that readers that aren’t familiar with the field wouldn’t need to refer to other works to understand the state of the art.

Hopefully, I can follow through on this ideal.

# Practical statistics basics: center and scale your predictors

Today, I want to discuss something that seems extremely small but is critical in “supervised” problems in which you are trying to predict some data $Y$ from some other data $X$. In a nutshell, you should always make sure that your predictors are centered (their center is 0) and scaled (their width is 1). Let’s dive into the details!

First, let me present the general “supervised learning” setting. We are given some number of pairs of examples $(X,Y)$. What we want to accomplish is to learn a function such that $Y \approx f(X)$. In general, we focus on trying to learn a linear function $f(X) = \sum \theta_i X_i$ but more general forms for the function are also possible. The usual approach is to write down a probabilistic model of the $Y$ conditional on the $X$ and $\theta$ and to maximize the likelihood to find the best values for $\theta$. If we want to be fancy, we can also add a regularizing penalty such the $L_2$ one: $\sum \theta_i^2$.

This is all straightforward statistics. However, there is one key step that many sources often forget to mention and it is critical. Quite simply, we need to do a little bit of processing on all the components of our predictor $X$. This processing needs to ensure that the various components $X_i$ are all (approximately) centered and scaled.

This can be justified in a variety of ways, but the one that makes the most sense to me is that we should try to make our methods as invariant to details of the input as they can be , unless we have a very good reason for that. In this case, it is trivial to imagine situations in which the predictors are shifted around for some reason. For example, if we choose different units for a measurement, that would change the scale of the predictors. It’s rarer for the center of the distribution to change, but that can sometimes happen. All of these modifications that could happen shouldn’t change the result of our inference. Thus, our methods should have a step to remove these extra degrees of freedom and ensure that our inference is invariant.

Furthermore, consider that we are trying to do is gain information from the $X_i$. When a value $x_i$ is close to the center of the $X_i$ values, that is a normal value of $X_i$ that should provide us with no information. Thus, it shouldn’t change our evaluation of $Y$. This intuition can only occur if the center of $X_i$ is 0. Similarly, in order to know how relevant it is that $x_i$ differs from its center, we need to know the scale at which $X_i$ varies. If the value we are considering is close to 0 at the relevant scale for $X_i$, then again this should have a low-impact on the value we predict for $Y$. Centering and scaling the predictors thus ensures that we treat the information we gain from all of them equally.

Now comes the thorny question: how exactly should we center and scale the predictors? Indeed, there are infinitely many notions of the center and scale of a random variable: should we center with the (empirical) mean of the $x_i$ ? or should we prefer the median? Should we scale using the square-root of the variance? Or the $L_1$ deviation: $min_{m} E( |X_i - m|)$? Or the inter-quartile spacing? I do not know the appropriate answer to these questions (and honestly, I’m not even sure there is a single appropriate answer). My instinct is to use a robust (key instinct: always be robust) measure of the width so the $L_1$ deviation sounds like a fine choice to me, but the variance is probably fine too.

As a final note, let me talk about one very cool method that people have been using for deep-learning that makes gradient descent work better. This method is called batch normalization. During batch training, we do not compute the output of the deep network as usual. What we do is compute the activity of each layer one by one, and before feeding its output to the next layer, we make sure that the output of each unit in the layer is centered and scaled (using the empirical mean and variance inside the training batch under consideration). This little trick really improves the speed at which the network learns (I’m not sure if anybody has a good intuition as to why, but I don’t).

As a conclusion, please always remember to center and scale your predictors when doing supervised learning!

# Some remarks on shrinking and the Stein phenomenon

I’ve been reading on the bias-variance trade-off and, most importantly, on the Stein phenomenon. Here are some of my thoughts on the subject which I hope can help others with this slightly thorny subject.

First, let us set the stage. We want to infer the value of some d-dimensional parameter: $\theta$. In order to do so, we are given a single observation $X$ which corresponds to $\theta$ corrupted by Gaussian noise with covariance the identity matrix:

$\displaystyle{ X = \theta + \eta }$

How should we estimate $\theta$ from $X$ ?

A natural idea consists in using $X$ itself. This is the maximum-likelihood estimator of $\theta$ and is indeed a good estimator of $\theta$. It is the best estimator that is translation-invariant; it is minimax; etc.

However, Stein has shown that $X$ is not actually perfect: there exists a whole family of estimators which are better than it, if $d \geq 3$. These are estimators of the form:

$\displaystyle{ \hat{\theta} = \left(1 - \frac{d-2}{\|X -\theta_0 \|} \right) (X-\theta_0) +\theta_0}$

No matter the true value of $\theta$, these estimators always have lower Mean-Squared Error than $X$. In other words, they always do a better job! In sense, it is thus slightly “stupid” to use $X$ instead of them, because you are going to make bigger errors by doing so.

The reason this occurs is due to the strong power of biasing your estimators in high-dimensions. Biasing causes an increase in error, but causes a stronger reduction in the variance, and is thus beneficial. The Stein-estimator creates a bias towards $\theta_0$ which is able to improve the performance of the estimation. While, other biased estimators such as adding a $L_1$ or $L_2$ penalty (in machine-learning’s horrible jargon, these are known as Lasso and Ridge penalties) would also improve over $X$ when close enough to $\theta_0$, the more general estimator $X$ is better than them when we are far away. However, the Stein-estimator is always better since its bias vanishes when $X$ is far from $\theta_0$.

When I first saw this result, I was very perplexed by the very peculiar form of the Stein-estimator. There is this tendency in math to present properties like this one as magical: “here is this one guy that just happens to have a crazy property”, instead of detailling where the property comes from. The Stein-estimator is usually presented in this manner, but it does have an interesting origin. Indeed, it is a actually close to a Bayesian estimator (more precisely, he can be derived from an empirical Bayes approach. Since it isn’t truly Bayesian, it should mean that there exists another estimator that dominates it). This makes a little bit of sense, since Bayesian estimators are known to have good Mean Squared Error. Of course, I’m still very curious whether their could be other biases which result in estimators which dominate $X$ everywhere. I’m just never satisfied with a single counter-example: I want to know the set of all counter-examples.

Thus, the Stein phenomenon gives us the high-level lesson that “Bias is good (in high-dimensions)”. However, there remains one question: in which direction should we bias our estimate? This is an interesting question.

Indeed, if we have no idea which $\theta_0$, one idea is to choose randomly around $X$ which constitues a natural first guess for $\theta$. However, that is stupid, by the following argument:

• Biasing our estimator towards $\theta_0$ is only good compared to $X$ if $\theta_0$ is pulling us towards the true value $\theta$.
• If we choose randomly around $X$, half of the values that we choose are going to be further away from $\theta$ than $X$
• Thus, randomly choosing our bias direction is a bad idea: it’s going to be worse than $X$ (see: https://arxiv.org/pdf/1203.5626.pdf for a more detailled presentation of that idea).

Thus, biasing is indeed good if we have true prior information about the direction in which we should bias our guess. If we don’t, then it is better to use the unbiased estimator $X$ since a randomly chosen bias is unlikely to actually achieve a reduction in error.