Today, I want to discuss something that seems extremely small but is critical in “supervised” problems in which you are trying to predict some data from some other data . In a nutshell, you should always make sure that your predictors are centered (their center is 0) and scaled (their width is 1). Let’s dive into the details!
First, let me present the general “supervised learning” setting. We are given some number of pairs of examples . What we want to accomplish is to learn a function such that . In general, we focus on trying to learn a linear function but more general forms for the function are also possible. The usual approach is to write down a probabilistic model of the conditional on the and and to maximize the likelihood to find the best values for . If we want to be fancy, we can also add a regularizing penalty such the one: .
This is all straightforward statistics. However, there is one key step that many sources often forget to mention and it is critical. Quite simply, we need to do a little bit of processing on all the components of our predictor . This processing needs to ensure that the various components are all (approximately) centered and scaled.
This can be justified in a variety of ways, but the one that makes the most sense to me is that we should try to make our methods as invariant to details of the input as they can be , unless we have a very good reason for that. In this case, it is trivial to imagine situations in which the predictors are shifted around for some reason. For example, if we choose different units for a measurement, that would change the scale of the predictors. It’s rarer for the center of the distribution to change, but that can sometimes happen. All of these modifications that could happen shouldn’t change the result of our inference. Thus, our methods should have a step to remove these extra degrees of freedom and ensure that our inference is invariant.
Furthermore, consider that we are trying to do is gain information from the . When a value is close to the center of the values, that is a normal value of that should provide us with no information. Thus, it shouldn’t change our evaluation of . This intuition can only occur if the center of is 0. Similarly, in order to know how relevant it is that differs from its center, we need to know the scale at which varies. If the value we are considering is close to 0 at the relevant scale for , then again this should have a low-impact on the value we predict for . Centering and scaling the predictors thus ensures that we treat the information we gain from all of them equally.
Now comes the thorny question: how exactly should we center and scale the predictors? Indeed, there are infinitely many notions of the center and scale of a random variable: should we center with the (empirical) mean of the ? or should we prefer the median? Should we scale using the square-root of the variance? Or the deviation: ? Or the inter-quartile spacing? I do not know the appropriate answer to these questions (and honestly, I’m not even sure there is a single appropriate answer). My instinct is to use a robust (key instinct: always be robust) measure of the width so the deviation sounds like a fine choice to me, but the variance is probably fine too.
As a final note, let me talk about one very cool method that people have been using for deep-learning that makes gradient descent work better. This method is called batch normalization. During batch training, we do not compute the output of the deep network as usual. What we do is compute the activity of each layer one by one, and before feeding its output to the next layer, we make sure that the output of each unit in the layer is centered and scaled (using the empirical mean and variance inside the training batch under consideration). This little trick really improves the speed at which the network learns (I’m not sure if anybody has a good intuition as to why, but I don’t).
As a conclusion, please always remember to center and scale your predictors when doing supervised learning!