In applied math, systems of equations rarely or never have a definite solution. Why? Because data points scatter themselves all over the place, and any model that attempts to fit the data is doomed to be imperfect ("doomed" is ironic here). In other words, the model is the best solution, in the absence of a perfect solution. The model will be a line, a plane, a hyperplane, a support vector... any number of lines, squiggles, or hyper-objects in multi-dimensional space, which come as closely as possible to each and every data point, without necessarily touching any of them.
In matrix-speak, the systems of equations, which might best fit the datasets, which everyday life yields, are rectangular. Think Excel spreadsheet. In a typical Excel spreadsheet, there are only a few columns, and potentially hundreds of thousands of rows. Geometrically speaking, the array structure of spreadsheet data forms a really tall, skinny rectangle. Converting said data into a system of independent variables plus a target variable, as one invariably does when modelling, renders it a rectangular matrix. If every matrix with a definite solution is square, then we know ahead of time that real life datasets are under-determined, i.e. rectangular.
Generating a "best fit" model is not as easy as it sounds. Geometrically speaking, drawing a model is easy. Most people can eyeball a best-fit line or curve. However, generating the equation of that model is another story. A model is supposed to come as closely as possible to every observed data point, which means that the measure of fitness is the distance between the model and the data. Anyone who has ever touched a statistical regression problem must have a good feel for the daunting chasm between theory and practice. In regression it becomes obvious that achieving the model equation means solving as many equations as there are data points, simultaneously.
Least Squares is the most used method in statistics to perform linear regression. Enter State Right: The Loss Function.
Below you will find the formula for Mean Squared Error, which is the function that most often gets "minimized" in least squares. The difference in Y terms, Y vs Ŷ, just is the difference between the model output and the observed data. Luckily, whenever minimization is desired, the math gods have given us a well stocked toolbox, with which to force the formula, as it were, and make it approach zero. The two main choices from Calculus, and more recently, Linear Algebra.
Data comes in many patterns, and may not be adequately modeled with linear functions. For nonlinear cost functions, there are other methods for minimization, such as gradient descent. In this post, we will deal only with linear data.
The Calculus Way
In Calculus, the first derivative is the instantaneous rate of change of a function. Wherever rate of change goes to zero, the original function, (of which the derivative is derived), reaches an inflection point: A hill or a trough, a maximum or a minimum. The MSE function produces only positive values for the coefficient of the squared term, which means it's convex. Thus, wherever the first derivative goes to zero, you can be sure you will find the coefficients for your model, because those will be the weights, which when applied to any x, will produce the smallest possible y. By this method, we back into our model via the solution set. Hence "regress."
"Just take the first derivative of the loss function and set it to zero, and voila! You have your model!" Sounds great, but computationally, the Calculus version is hell, especially with multiple coefficients (for multiple independent variables). In the robust parametrized models of modern data science, there are literally thousands of independent variables. Each and every variable requires taking a partial derivative with respect to its respective coefficient, then setting the resulting function equal to zero. That is a lot of hairy math!
The math is so hairy, in fact, that it's tough to find any instances where an instructor actually slogs through the computations. Usually instructors compute partial derivatives, and end with hand waving and assurances that software that will do the computations for you... And if you care to go through the computations below, you will find out why. The funniest thing about the Calculus formula is that it resolves into a matrix at the last step. It resolves into the matrix, which constitutes Step 1 of the Linear Algebra methods. That's right! You can't even do Calculus without Linear Algebra. So why leave the reservation in the first place? Obviously, I am kidding. But every joke contains some kernel of truth.
Jump to "The Linear Algebra Way" to skip over the Calculus computations.
Let's go through the steps of using Calculus to compute a fairly simple linear regression model (with two parameters):
I. Define the Multiple Linear Regression Model:
Ŷ = β_{0}+β_{1}X_{1}+β_{2}X_{2}
where:
- Ŷ is the model (predicted values).
- X_{1} and X_{2} are the independent variables.
- β_{0} is the y-intercept.
- β_{1} is the coefficient for X_{1}.
- β_{2} is the coefficient for X_{2}.
II. Minimize the MSE with Respect to β_{0}, β_{1}, and β_{2}:
Now that we have defined the model function, we can plug it into the loss function (in this case the MSE), and "optimize" it. To minimize the MSE, you need to find the values of β_{0}, β_{1}, and β_{2}, which make it equal to zero at the respective first derivative. To do this, you set the partial derivatives of the MSE with respect to each coefficient (β_{0}, β_{1}, and β_{2}) equal to zero. (Note that every ∑ summation term below is a sum to n with interval 1. In some cases, I have simplified the notation.)
1. Partial Derivative with Respect to β_{0}:
∂/∂β_{0} (1/n ∑_{i=1} (Y_{i}− (β_{0}+β_{1}X_{1i}+β_{2}X_{2i}))^{2}) = 0
2. Partial Derivative with Respect to β_{1}:
∂/∂β_{1} (1/n ∑_{i=1} (Y_{i}− (β_{0}+β_{1}X_{1i}+β_{2}X_{2i}))^{2}) = 0
3. Partial Derivative with Respect to β_{2}:
∂/∂β_{2} (1/n ∑_{i=1} (Y_{i}− (β_{0}+β_{1}X_{1i}+β_{2}X_{2i}))^{2}) = 0
Maybe this doesn't look so bad to you, but we have only just written out formulas. In order to solve them, we need to do a lot of algebra. In what follows, I will go through the actual steps of solving... Once we take all of these partial derivatives, you will see that the resulting system of equations, sometimes called "the normal equations," resolves to be the same linear system as A^{T}Ax̂=A^{T}b. Yes, the MSE is quadratic, but its derivatives are linear with respect to the weights.
Note that we don't really need the 1/n part of the MSE equation. For the purposes of our equation, 1/n is a constant, because n is a known variable. It is not one of the unknowns, which our equation aims to solve. Any constant can be factored out of all terms of a summation–and in the context of setting the equation equal to zero, can be cancelled out if desired. It's good to know that it's there, and of course, it in some ways is a very important number. "Sum of Squares," the non-normalized cousin of MSE, is basically the same thing as MSE, except not averaged, i.e. not divided by n.
Alright, let's keep plugging away:
IIIa. Take a partial derivative with respect to β_{0} (Note: All ∑ notation is ∑_{i=1} summed to n):
OK, so we start with the MSE equation, which again, is our loss function:
MSE = (1/n ∑(Y_{i} - (β_{0} + β_{1}X_{1i} + β_{2}X_{2i}))^{2})
Let's compute the derivative and set it equal to zero. Here you may be tempted to expand the squared term (Y_{i} - (β_{0} + β_{1}X_{1i} + β_{2}X_{2i}))^{2}, but don't do it! It's much easier to use the chain rule. Bring down the exponent down into a leading constant, and derive the inner function. Thus the derivative simplifies as follows:
∂/∂β_{0} (2/n) ∑(Y_{i} - (β_{0} + β_{1}X_{1i} + β_{2}X_{2i})) * (-1) = 0
Next, you can get rid of 2/n by multiplying both sides by its inverse. Then multiply through with the -1 derivative term:
∑(Y_{i} - β_{0} - β_{1}X_{1i} - β_{2}X_{2i}))* (-1) = 0
First, we distribute the minus sign inside the parenthesis above. Make sure you get your signs right! We don't want to make a negative/positive signs error. (I made a sign error while preparing for this post, even though I tried hard not to):
∑(-Y_{i} + β_{0} + β_{1}X_{1i} + β_{2}X_{2i}) = 0
Now, we want to isolate all components of the equation, which are subject to summation. These isolated components will result in modular building blocks, so to speak, which are easier to work with. In other words, we move the ∑Y_{i} term out, followed by the other terms.
-∑Y_{i} + ∑β_{0} + β_{1}∑X_{1i} + β_{2}∑X_{2i}) = 0
The y-intercept β_{0} ultimately is a constant. Therefore, the sum of all β_{0} is just n*β_{0}. Similarly, the coefficients β_{1} and β_{2} will be constants, which means they can be moved in front of their summations.
-∑Y_{i} + nβ_{0} + β_{1}∑X_{1i} + β_{2}∑X_{2i} = 0
The final step of this calculation, once we draw the partial derivatives for the other two variables, will be to organize all three equations in a linear system. In order to do this, we will want all of the equations in the form of β_{0}+β_{1}+β_{2}=Y_{i} as shown:
∑Y_{i} = nβ_{0} + β_{1}∑X_{1i} + β_{2}∑X_{2i}
Let's keep this equation on the back burner, while we work on the other two coefficients' partial derivatives. Also, just to note, we can bring back the 1/n if we want. Not needed at the moment.
IIIb. To take a partial derivative with respect to β1, you follow these steps (all ∑ notation is ∑_{i=1} summed to n):
The MSE equation is (the same as the equation we started with above):
∂/∂β_{1} (1/n ∑(Y_{i} - (β_{0} + β_{1}X_{1i} + β_{2}X_{2i}))^{2}) = 0
Let's compute the partial derivative and set it equal to zero, like we did before. Again, using the chain rule, the derivative simplifies as follows:
∂/∂β_{1} = (2/n) ∑(Y_{i} - (β_{0} + β_{1}X_{1i} + β_{2}X_{2i})) * (-X_{1i}) = 0
Once again, we can get rid of the 2/n because it's just a constant, and we can multiply both sides by its inverse. We can always bring it back later if we want:
∑(Y_{i} - β_{0} - β_{1}X_{1i} - β_{2}X_{2i})) * (-X_{1i}) = 0
Distribute the -X_{1i} term from the outer parenthesis, multiplying each term inside the parenthesis by the X_{1i} term.
∑(-Y_{i}X_{1i} + β_{0}X_{1i} + β_{1}X_{1i}^{2} + β_{2}X_{1i}X_{2i}) = 0
Distribute the summation terms. Then, factor out the constant terms for each summation component.
-∑Y_{i}X_{1i} + ∑β_{0}X_{1i} + ∑β_{1}X_{1i}^{2} + ∑β_{2}X_{1i}X_{2i} = 0
As before, we want to organize the terms so that we can easily put them into a matrix of linear equations at the last step. It may help to realize that the X and Y terms are not variables, which we need to solve for. Values for those terms will be given by the data points we have. We are solving for the coefficient terms β_{0}, β_{1}, and β_{2}.
∑Y_{i}X_{1i} = β_{0}∑X_{1i} + β_{1}∑X_{1i}^{2} + β_{2}∑X_{1i}X_{2i}
IIIc. To take a partial derivative and solve the first equation for β_{2}, you follow these steps (again all ∑ notation is ∑_{i=1} summed to n):
OK the MSE equation again: ∂/∂β_{2} (1/n ∑(Y_{i} - (β_{0} + β_{1}X_{1i} + β_{2}X_{2i}))^{2}) = 0
Let's compute the derivative and set it equal to 0:
∂/∂β_{2} (1/n ∑(Y_{i} - (β_{0} + β_{1}X_{1i} + β_{2}X_{2i}))^{2}) = 0
Using the chain rule, the derivative simplifies as follows:
(2/n) ∑(Y_{i} - (β_{0} + β_{1}X_{1i} + β_{2}X_{2i})) * (-X_{2i}) = 0
Now, you can move 2/n out of the equation by multiplying both sides by n/2. Then distribute the X term:
(2/n) ∑(Y_{i} - (β_{0} + β_{1}X_{1i} + β_{2}X_{2i})) * -X_{2i} = 0
∑(-Y_{i}X_{2i} + β_{0}X_{2i} + β_{1}X_{1i}X_{2i} + β_{2}X_{2i}^{2}) = 0
As before, distribute the summation. Remember that all β terms are constants, so they can be moved in front of the sigma:
-∑Y_{i}X_{2i} + β_{0}∑X_{2i} + β_{1}∑X_{1i}X_{2i} + β_{2}∑X_{2i}^{2} = 0
Now, isolate the term containing Y_{i} on one side:
∑Y_{i}X_{2i} = β_{0}∑X_{2i} + β_{1}∑X_{1i}X_{2i} + β_{2}∑X_{2i}^{2}
IV. Now we can arrange our three equations into a system of linear equations, and given some data points, we have everything we need to solve for our coefficients.
∑Y = nβ_{0} + β_{1}∑X_{1i} + β_{2}∑X_{2i}
∑YX_{1i} = β_{0}∑X_{1i} + β_{1}∑X_{1i}^{2} + β_{2}∑X_{1i}X_{2i}
∑YX_{2i} = β_{0}∑X_{2i} + β_{1}∑X_{1i}X_{2i} + β_{2}∑X_{2i}^{2}
These equations are very cool. You will see why in a bit.
The final step in the calculus version is to take some sample data points, and plug them into the equations above. Then use row reduction to solve the system. Example Data:
- (x_{1₁}, x_{2₁}, y_{₁}) = (2, 1, 4)
- (x_{1₂}, x_{2₂}, y_{₂}) = (3, 2, 5)
- (x_{1₃}, x_{2₃}, y_{₃}) = (4, 6, 7)
- (x_{14}, x_{24}, y_{4}) = (5, 4, 5)
First, we need to compute the values for the right-hand sides of these equations using the provided data:
Let's calculate the sums for each variable:
- Number of data points (n) = 4
- Sum of the dependent variable (Y):
∑Y = y_{₁} + y_{₂} + y_{₃} + y_{₄} = 4 + 5 + 7 + 5 = 21 - Sum of the first independent variable (X_{₁}):
∑X_{₁} = x_{1₁} + x_{1₂} + x_{1₃} + x_{14} = 2 + 3 + 4 + 5 = 14 - Sum of the second independent variable (X_{₂}):
∑X_{₂} = x_{2₁} + x_{2₂} + x_{2₃} + x_{24} = 1 + 2 + 6 + 4 = 13 - Sum of the squares of the first independent variable (X_{₁}²):
∑X_{₁}² = x_{1₁}² + x_{1₂}² + x_{1₃}² + x_{14}² = 4 + 9 + 16 + 25 = 54 - Sum of the product of the first and second independent variables (X_{₁}X_{₂}): ∑X_{₁}X_{₂} = x_{1₁} * x_{2₁} + x_{1₂} * x_{2₂} + x_{1₃} * x_{2₃} + x_{14} * x_{24} =
(2 * 1) + (3 * 2) + (4 * 6) + (5 * 4) = 2 + 6 + 24 + 20 = 52 - Sum of the squares of the second independent variable (X_{₂}²):
∑X_{₂}² = x_{2₁}² + x_{2₂}² + x_{2₃}² + x_{24}² = 1 + 4 + 36 + 16 = 57 - Sum of the product of the dependent variable and the first independent variable (∑YX_{1i}): ∑YX_{1i} = (y_{₁} * x_{1₁}) + (y_{₂} * x_{1₂}) + (y_{₃} * x_{1₃}) + (y_{4} * x_{14}) =
(4 * 2) + (5 * 3) + (7 * 4) + (5 * 5) = 8 + 15 + 28 + 25 = 76 - Sum of the product of the dependent variable and the second independent variable (∑YX_{2i}): ∑YX_{2i} = (y_{₁} * x_{2₁}) + (y_{₂} * x_{2₂}) + (y_{₃} * x_{2₃}) + (y_{4} * x_{24}) =
(4 * 1) + (5 * 2) + (7 * 6) + (5 * 4) = 4 + 10 + 42 + 20 = 76
Now, we can substitute these values, as "modular components" into the normal equations:
nβ_{0} + β_{1}∑X_{1i} + β_{2}∑X_{2i} = ∑Y
β_{0}∑X_{1i} + β_{1}∑X_{1i}^{2} + β_{2}∑X_{1i}X_{2i} = ∑YX_{1i}
β_{0}∑X_{2i} + β_{1}∑X_{1i}X_{2i} + β_{2}∑X_{2i}^{2} = ∑YX_{2i}
- 4β_{0} + 14β_{1} + 13β_{2} = 21
- 14β_{0} + 54β_{1} + 52β_{2} = 76
- 13β_{0} + 52β_{1} + 57β_{2} = 76
Nice! Now all we need to do is put these numbers into a matrix, and row reduce, or LU decompose, and solve for β_{0}, β_{1}, and β_{2}.
The Linear Algebra Way
If you are like me, you dislike sigma notation, ∑. Granted, there is an element of personal bias in my memories of first encounters with sigma. Summations marked the beginning of the end for me, in high school. This had nothing to do with math. Rather, I couldn't hear myself think over the sound of the angels screaming, so to speak, and I remember sigmas being among the flotsam swirling in the dark and roiling waters, in which my soul nearly drowned. But now I am back, and I don't mind sigma notation. However, if I can avoid it, I would rather avoid it.
Probably my favorite thing about Linear Algebra, is that summation is built into matrix operations, which frees us from the need for the scary-looking sigma notation... Summation is built into the matrix of matrices, as it were. In other words, summation is a default matrix operation, so we don't need the sigma notation. Linear algebra formulas for least squares eliminate the annoying sigma notation, but perhaps even more joyously, they make added data dimensions really simple to compute. The same formula that works in one dimension works in three, twelve, and thirty thousand dimensions. I don't want to look at the Calculus version with 30,000 x-variables. Perish the thought!
Here it is, the linear algebra formula for Least Squares:
Isn't it beautiful? I honestly want to get a tattoo of this formula. All you need to solve it are two things, which in any data project, you are sure to have: (1) Your x-values, arranged in an array; and (2) your y-values, (the predicted values), arranged in a vector, appended to the array. That's it! The Fundamental Theorem of Linear Algebra gives us this beautifully elegant least squares formula, and it works for any system of linear equations (any matrix). If you are interested, I derive the formula in the next section of this post.
In the Calculus version, once we take all of the requisite partial derivatives, the resulting system of equations, sometimes called "the normal equations," resolves to be the same linear system as the system of matrices generated by the linear A^{T}Ax̂=A^{T}b. It's pretty amazing how that works.
So, let's pretend that we have the same data points from the Calculus example above:
- (x_{1₁}, x_{2₁}, y_{₁}) = (2, 1, 4)
- (x_{1₂}, x_{2₂}, y_{₂}) = (3, 2, 5)
- (x_{1₃}, x_{2₃}, y_{₃}) = (4, 6, 7)
- (x_{14}, x_{24}, y_{4}) = (5, 4, 5)
Now, this is crucial: The y-intercept is just a static number, so the value for its X_{0} is always 1. Thus Matrix A is defined:
Let's pause and notice something about Matrix A. First thing: Matrix A is rectangular. It's not exactly a spreadsheet with 10,000 rows, but it's still a tall rectangular matrix. If we had one data point fewer, all three points would lie on a plane, since a plane can be defined by three points (linearly independent, of course). What transforms Matrix A into a problem requiring a least squares solution is the fourth point, which throws us off of the would-be plane.
Now that we have Matrix A, we can plug it into the formula A^{T}Ax̂=A^{T}b. First we transpose A. Given Matrix A above, below is the equation we need to solve. From left to right, we have A transpose, multiplied by A, the coefficient vector x̂. On the right side of the equation, we see A transpose again, multiplying b, which is our given vector of "solutions," i.e. data points.
Now we are ready to rock and roll. First, we augment matrix A with vector b, and perform the matrix multiplication of A^{T}A and A^{T}b in one fell swoop, as it were. From our transformed solution vector, we are going to mine and extract the coefficient vector x̂: [β_{0},β_{1},β_{2}]. Below is the completed matrix multiplication, with A^{T}A and A^{T}b shown on the right.
Remember this guy, from the Calculus example above?
- 4β_{0} + 14β_{1} + 13β_{2} = 21
- 14β_{0} + 54β_{1} + 52β_{2} = 76
- 13β_{0} + 52β_{1} + 57β_{2} = 76
Look familiar? Maybe I can show it thus:
- 4β_{0} + 14β_{1} + 13β_{2} = 21
- 14β_{0} + 54β_{1} + 52β_{2} = 76
- 13β_{0} + 52β_{1} + 57β_{2} = 76
It's the normal equations! The ones we have yet to solve... The totals we generated using the sum of squares summations. The numbers look eerily similar to the augmented matrix A^{T}A|b. And there's a reason for that. Because, what we did in the Calculus example was compute the squared residuals, which is exactly what we do with the linear algebra formula. This is just ||ax-b||^{2}. It looks the same because it is the same. (Later on, I will derive the linear algebra formula from ||ax-b||^{2}).
The big win here is that it took us a lot of sloppy number crunching and summation module organization to get to this point in the Calculus. In Linear Algebra, you get to Step 50 in Step 2 (of course I'm exaggerating). From here, we can solve the equation just like we would solve any old Ax=b equation. Row reduce the augmented matrix, or decompose into LU, and find vector x.
I won't lie. There is still a lot of computation that goes into solving this system of equations. However, in the Calculus method, there are a whole bunch of steps, shown earlier in this post, to so much as arrive at a system of equations worth solving. The Linear Algebra formula is not only visually cleaner and easier to remember. It's also faster. It skips the whole aspect of computing partial derivatives.
Below are the computations, which I did by hand, to solve the system of equations. Initially I asked ChatGPT to double-check my work, but it failed miserably. So I used the Scipy function lstsq to double-check my work. I also did an LU decomposition with paper and pen just for sh*ts and giggles. Don't believe me? Here's the photographic evidence:
I prefer row reduction to LU decomposition, personally. Especially when I can do the row reduction methodically, and in a digital format, where I can copy and paste matrices, to easily move through steps. I like to record all of my transformations in a matrix E, on the lefthand side of the matrix being transformed. It helps me keep track of what I am doing, and what I have already done. When I stumble upon a computational error, it's easier to backtrack and figure out where I went wrong. As I move through further transformations, I change the previous transformations from black to grey, to indicate which transformations happened when.
Matrix E is the lower triangular matrix on the left side, shown below. You can see that solving this equation involved three row transformations, each involving two rows. The first transformation was -3*Row One added to Row Two. The second, -3*Row One added to Row Three. The final transformation is -2 1/2*Row Two added to Row Three. The resulting upper triangular matrix, augmented with A^{T}b, presents us with the new system of equations, which easily yields the solution vector, through back substitution (which I show in the following image).
The computations involved in row reduction are not difficult: Just multiplying and adding! But with each step being cumulative, any mistake made early on can be difficult to unwind. Whenever I row reduce matrices, I am reminded of rebuilding the fallen domino structures, which my son used to make, when he went through his domino phase. One error will cause a cascade of errors, and may require the entire computation to be redone.
So what makes the linear algebra formula so fantastic, if it's still a treacherous slog at the end of the day? Well, a few things. First of all, just the appearance of the formula. The visual simplicity it provides gives a clear roadmap of what the formula is doing, i.e. finding x̂. In turn, this clarity of purpose commits itself to memory quite easily. The Calculus formula, on the other hand, has weak associations with x̂, which does not so much as appear in the formula. There is no indication that a vector of solutions is sought, and it's difficult to parse which symbols represent variables, versus which symbols represent known values.
The Calculus formula gets us to the same place, but without being clear about its goals or operations.
Not only do we skip the cumbersome steps of taking partial derivatives, but computationally it's more efficient besides. A^{T}A and A^{T}b can be generated simultaneously, although often instructions point you to do A^{T}A and A^{T}b separately. This way you can do away with the step of finding partial derivatives, entirely.