Regression IV – Does it really work (mathematically speaking)?

So, in an earlier post, we saw how to derive the formula for the coefficients of regression. We did this by defining the error function

\[ (X\beta – Y)^T(X\beta – Y) \]

differentiating with respect to \(\beta\) and setting to zero to get

\[ \beta^T X^T X - Y^T X = 0 \]

and then solving for \(\beta\).

Although our method was correct, we were cheating a little. As this function is twice continuously differentiable, we have definitely found a stationary point, but we haven't show that this is a minimum as opposed to a saddle point or a maximum.

To do this, let's have a look at the second derivative of our function (the Hessian). This is the constant matrix

\[ X^TX \]

So, at any point on our surface and for any vector \(v\), the second directional derivative will be

\[ v^T X^T X v \]

We can rewrite this as,

\[ (X v)^T (X v) \]

Which is just the dot product of the vector \(Xv\) with itself, so it is always greater than or equal to zero. More concisely, the Hessian is positive semidefinite everywhere.

This means that at every point on the surface, the directional derivative is always increasing in every direction. In particular this means that at any stationary point, when we move away from it in any direction, the value of our function will increase, so, our stationary point is in fact a minimum. What is more, it is a global mininum! This property, a positive semidefinite Hessian at every point, is called convexity.

Now, how do we know that this global minimum is unique? Well, we derived a formula for a unique global minimum,

\[ (X^TX)^{-1}X^TY \]

However, when we did it we sneakily assumed that the matrix \(X^TX\) was invertible. If it is not invertible, there will not be a unique minimum. This matrix is invertible exactly when X has full column rank. We know from linear algebra, that this matrix is invertible exactly when the matrix \(X\) has full column rank. However the columns of \(X\) are our predictors, so we have a unique minimum, exactly when our predictors are linearly independent.

If we think about this it makes a lot of sense. Suppose one of our predictors is a multiple of another, say \(X_1 = a X_2\). Then for any stationary point \(\beta_1,\beta_2,...\) for our error function, we can trivially create infinitely many new stationary points by replacing \(\beta_1\) with \(\beta_1 + t\) and \(\beta_2\) with \(\beta_2 - \frac{t}{a}\) for any \(t\). In other words, we have an extra degree of freedom.

This can cause us problems if we are finding our minimum numerically. Suppose one of our predictors is almost a linear combination of the others, then the optimisation that searches for the minimum will spend a lot of time swimming around before it finds it.

Regression III – A Practical Example

We’ve already covered the basics and some of the theory or regression. So now, let’s do a practical example of linear regression in python.

Let’s mock up some practice data. We’ll create a simple predictor X, that just takes all the values between 0 and 20 in steps of 0.1, and a response Y, which a depends on X linearly via Y = 5X + 3 + (random noise). We generate this in python like so

import numpy as np

X = np.arange(0, 20, 0.1)
Y = 5 * X + 3 + np.random(200)/10

Now we can create a linear model and fit it with scikit-learn

from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X.reshape(-1,1),Y)

We need to reshape the X array, because the fit method expects a two dimensional array, the outer dimension for the samples and the inner dimension is the predictors.

We can see the coefficients of this model with


Which should give something like

(array([5.00041147]), 3.0461878369249646)

Then, if we would like to forecast values, say for the X value 11.34, we can use the predict method on our model


Regression II – A Little Theory

Now that we have introduced linear regression, we are going to think a little bit about the theory behind linear regression. Suppose we have \(n\) predictors, which we call \(X_1,…,X_n\), and a response which we call \(Y\), and we would like to do some regression. To do this, we need some values for our predictors and response, let’s say we have \(m\) of these. We must remember to distinguish between the variables \(X_1,…,X_n,Y\) and the values which we usually call samples. The former we generally identify with the axes of a \(n+1\) dimensional space, and the latter with points in that space.

So, we would like to find a hyperplane that comes as close as possible to each of our \(m\) sample points. More specifically, if we consider our hyperplane as a function of \(X_1,…,X_n\), then for each sample \(i\), we would like to minimise the distance between the point in our hyperplane at \(X_{1,i}…,X_{n,i}\) and the point \(X_{1,i}…,X_{n,i},Y_i\).

This is all a bit hard to visualise, so let’s think about things algebraically. We can think of our hyperplane as the set of points defined by

\[ \beta_0 + \beta_1 X_1 + … + \beta_n X_n \]

for some scalars \(\beta_0,…,\beta_n\). And for each sample \(i\), we let

\[ \beta_0 + \beta_1 X_{1,i} + … + \beta_n X_{n,i} = \hat{Y_i} \]

Now, we need to pick a good algebraic notion of distance. Our best bet is the euclidean distance, this lines up pretty well with our intuitive notion of geometric distance and it has has plenty of nice mathematical properties.

So for each \(i\), we want to minimise:

\[ \sqrt{(X_{1,i} – X_{1,i})^2 + … + (X_{n,i} – X_{n,i})^2 + (Y_i – \hat{Y_i})^2 } = \sqrt{(Y_i – \hat{Y_i})^2} \]

The square root is an increasing function, so, although it changes the value of the minimum of a function, it does not change it’s location, so we can leave it out. Also, we would like to minimise this value for all points simultaneously, as they are all positive, the simultaneous minimum will be the minimum of the sum, so we take the sum over all \(m\) samples, which gives us

\[ \sum_{i = 1} ^m { (Y_i – \hat{Y_i})^2 } \]

And we can rewrite this using our definition of \(\hat{Y}\) as

\[\sum_{i=1}^m{ \left( Y_i – (\beta_0 + \sum_{j=2}^N\beta_j X_{j,i})\right)^2}\]

So, we have a formula that we would like to minimise, and the good news is that there is an analytic solution to this! We just follow the normal process for analytically finding the location of a minimum, we differentiate our objective function with respect to the \(\beta\) and then we set the result to zero and solve for the \(\beta\).

Rather than deal with various different indices, we are going to convert our objective function into matrix form. To do this, we borrow a standard trick, we add a new predictor \(X_0\) which has constant value \(1\) to the set of predictors \(X_1…,X_n\). This gets rid of the floating \(\beta_0\) term, so we can now write

\[ \hat{Y_i} = \beta_0 + \sum_{j = 1}^N{\beta_j X_{j,i} } = \sum_{j = 0}^N{\beta_j X_{j,i} } \]

Then we can rewrite this as a matrix multiplication

\[ \hat{Y} = X \beta \]

Here, the columns of \(X\) represent predictors and the rows represent samples. The value we would like to minimise can then be rewritten in matrix form as.

\[ ( X \beta – Y)^T ( X \beta – Y)\]

Before we differentiate, we are going to rearrange a little. So, we multiply this out and get

\[ (X \beta)^T X \beta – Y^T X \beta – ( X \beta)^T Y + Y^T Y \]

We can simplify this. Firstly note that, the middle two terms are scalars, so we can freely transpose them without changing their value, so we have

\[ Y^T X \beta = ( X \beta)^T Y \]

Also, we can distribute the transpose in the first term, putting this together we get,

\[ \beta^T X^T X \beta – 2 Y^T X \beta + Y^T Y \]

Now, we follow the normal rules for differentiation of vectors and matrices. We can think of the first term as

\[ \beta^T A \beta \text{ where } A = X^TX\]

The derivative of which is

\[ \beta^T (A^T + A) = \beta^T ((X^T X)^T + (X^T X)) = \beta^T ((X^T X) + (X^T X)) = 2 \beta^T X^T X \]

Have look at this for the first step. Then, when we differentiate the second term of our objective function, we are just differentiating a row vector times a column vector, so we get

\[-2 Y^T X\]

Finally, the third term is constant with respect to \beta so the derivative is zero. Putting this all together we get

\[ \beta^T (X^T X ) – Y^T X = 0 \]

Rearranging and taking the transpose gives

\[ \beta = (X^T X)^{-1} X^{T} Y \]

using the fact that the transpose of an inverse is the inverse of a transpose.

So, the coefficients of our regression are determined exactly by this equation! All we have to do is plug our sample values into the \(X\) and \(Y\) matrices, and we have the parameters of the best fit hyperplane.

In real life, when we have a lot of predictors or a lot of samples, doing this matrix algebra can be quite intensive, and it may in fact be easier to solve this optimisation problem numerically. The good news is that our objective function is convex, so it has a unique global minimum that we can find easily.

Regression I – What is Regression?

Suppose we have a long list of pairs of real numbers. We can imagine these pairs of numbers as points in a plane. Then we can try and draw a line that is as close as possible to all the points at once. This is regression. It can be any sort of line that we draw, but a straight line is often best. This is linear regression. For now, we are only going to consider linear regression.

Usually we think of one of the values as causing the changes in the other. The value that is causing this change is called the predictor, and the other value is called the response.

Why would we want to do this? Well first we might want to investigate if there is a relationship between our two variables, and if there is, to measure how strong it is. We can answer these question by looking at how easy it is to draw a line that fits our points, and the slope of that line. The first gives us an indication of how well our model has fit our data and the second tells us the strength of the relationship. However, it is possible to have a line that fits very well, but for which the slope is approximately zero, this tells us that even though we can fit the data, we haven’t found any evidence of a relationship.

The second reason we might do regression is because we want to forecast values of our response. Suppose that we have fit our data well and we have convinced ourselves that we are modelling some underlying process. Say our matched pairs of data are the price of oil at time t – 1, and the stock price of an airline at time t. We can perform a regression on the data, then, whenever we have a new data point for the price of oil, we can forecast a new price for the stock.

Before we drew our line, we didn’t make any particular assumptions. We didn’t say anything about things being independent, or drawn from the same distribution, or normal, or in fact anything about distributions at all. There is a lot of theory that explains regression in mathematical terms, and talks about various assumptions required, and facts you can assume. But it only really exists to provide intellectual justification for what we have talked about.

Something important to remember is that just because we are doing a regression and finding a nice line that fits our data does not mean that there really is a relationship between the values. In particular, if we do a linear regression it does not necessarily mean that there is a linear relationship between them. In the other direction, if we fail to find a relationship via regression, that does not mean one does not exist, in particular if we don’t find anything with linear regression, it is possible that there is a non-linear relationship that we are not capturing. With all that scepticism in mind, we can say that linear regression is really useful, because a lot of process are linear, or at least well approximated linearly, at least locally.

Something that I haven’t mentioned is that we are not limited to just a single predictor for each response, we can have multiple predictors. Geometrically speaking, it gets a bit more complicate, for example if we have two predictors, now we are drawing points in three dimensional space and instead of drawing a line, we are drawing a plane.