Solving for N Variables Given M Equations

You may be been exposed to the idea that you can usually solve for N variables when you have N equations. For regression, the coefficient we're trying to find for each column is a variable, and every row can be used to generate an equation. For tall skinny tables, then, we have a problem: more equations than variables. Sometimes this is solveable (for example, maybe some rows are repeated, such that there are fewer unique equations than one may initially assume), but most often it is not. Our strategy: tweak the y values (as little as possible!) to create a solveable problem.

Algebra Review: Solving Multiple Equations

Consider the following DataFrame:

In [1]:
import numpy as np
import pandas as pd
df = pd.DataFrame([[2, -1, 1, 6],
                   [0, 1, 1, 2],
                   [0, 3, 1, 8],
                   [0, 3, 1, 8.1],
                  ],
                  columns=["x0", "x1", "x2", "y"])
df
Out[1]:
x0 x1 x2 y
0 2 -1 1 6.0
1 0 1 1 2.0
2 0 3 1 8.0
3 0 3 1 8.1

We want to find some coefficients relating y to the x values, as follows:

x0*c0 + x1*c1 + x2*c2 = y

Each row of df gives us some x and y values we can plug into the above formula to get an equation. For example, row 0 gives us this:

2*c0 + -1*c1 + 1*c2 = 6

Let's ignore the fourth row for now. The first three rows give us the following equations:

  1. 2*c0 + -1*c1 + 1*c2 = 6
  2. 0*c0 + 1*c1 + 1*c2 = 2
  3. 0*c0 + 3*c1 + 1*c2 = 8

If we multiply both sides of equation 2 by 3, we get this:

  1. 0*c0 + 3*c1 + 3*c2 = 6

If we subract equation 4 away from equation 3, we get this:

  1. 0*c0 + 0*c1 + -2*c2 = 2

At this point, we can solve for c2 = -1. We can plug this value for c2 back into equations 1 and 2 to get the following:

  1. 2*c0 + -1*c1 + -1 = 6
  2. 0*c0 + 1*c1 + -1 = 2

Equation 2 only has one variable now, c1, and we can easily find that c1 = 3. Plugging that into equation 1, we further find c0 = 5.

Solving for c given X and y in X @ c = y is the matrix way of expressing this system of equations, and np.linalg.solve can quickly find the same answer we found manually (note that the slicing excludes the last row):

In [2]:
np.linalg.solve(df.iloc[:-1][["x0", "x1", "x2"]], df.iloc[:-1]["y"])
Out[2]:
array([ 5.,  3., -1.])

The Problem: most y columns lead to contradictions for DataFrames with more rows than columns

Let's look at the last two rows of the original DataFrame again:

In [3]:
df
Out[3]:
x0 x1 x2 y
0 2 -1 1 6.0
1 0 1 1 2.0
2 0 3 1 8.0
3 0 3 1 8.1

Converting the last two rows to equations, we get:

  1. 0*c0 + 3*c1 + 1*c2 = 8
  2. 0*c0 + 3*c1 + 1*c2 = 8.1

Of course, this is a contradiction, and there is no mathematically correct answer. But must we completely throw away the solution we found earlier? After all, maybe the 8.1 included a small measurement error. Having 4 equations and 3 variables would have been fine if the that fourth was somehow redundant with the earlier equations.

Solution: modify the y column

Our solution will be to modify the y column to turn this into a solveable problem. For the solution to our modified problem to be meaningful to our original problem, the modifications to y will need to be as small as possible.

We'll call our modified y column p. We'll compute p as p = P @ y. P is what we call a projection matrix. A projection matrix has two properties. (1) When we multiply it by an vector using the dot product, we'll get back a vector that can be used in the y column and will lead to a solveable problem. (2) the vector we get back is as close to the original vector as possible.

The values in a projection matrix P will depend on the X values we're working with. We won't derive the formula here (320 isn't a math class, after all), but it can be computed as follows:

P = X @ np.linalg.inv(X.T @ X) @ X.T.

np.linalg.inv computes the inverse of a matrix (matrix inverses are a topic covered in any linear algebra course, but we won't delve into it here). .T is transpose, which means a swap of rows and columns:

In [4]:
df.T
Out[4]:
0 1 2 3
x0 2.0 0.0 0.0 0.0
x1 -1.0 1.0 3.0 3.0
x2 1.0 1.0 1.0 1.0
y 6.0 2.0 8.0 8.1

Let's write a function to compute a projection matrix:

In [5]:
def projection_matrix(X):
    return X @ np.linalg.inv(X.T @ X) @ X.T

P = projection_matrix(df[["x0", "x1", "x2"]].values)
P
Out[5]:
array([[1. , 0. , 0. , 0. ],
       [0. , 1. , 0. , 0. ],
       [0. , 0. , 0.5, 0.5],
       [0. , 0. , 0.5, 0.5]])
In [6]:
df["p"] = P @ df["y"]
df
Out[6]:
x0 x1 x2 y p
0 2 -1 1 6.0 6.00
1 0 1 1 2.0 2.00
2 0 3 1 8.0 8.05
3 0 3 1 8.1 8.05

See what happened? The projection matrix turned this into a solveable problem! p is close to y, but whereas trying to solve X @ c = y lead to the following contradiction:

  • 0*c0 + 3*c1 + 1*c2 = 8
  • 0*c0 + 3*c1 + 1*c2 = 8.1

Now we get this:

  • 0*c0 + 3*c1 + 1*c2 = 8.05
  • 0*c0 + 3*c1 + 1*c2 = 8.05

Athough we now have a problem with a mathematical solution, np.linalg.solve still only works with square matrices (this is not a fundamental limitation -- you could imagine building a better version of solve that works for any mathematically solveable problem):

In [7]:
try:
    c = np.linalg.solve(df[["x0", "x1", "x2"]], df["p"])
except Exception as e:
    print(e)
Last 2 dimensions of the array must be square

It turns out that the solution is simple, even though we don't derive the formula here. We originally wanted to solve (1) c = np.linalg.solve(X, y), but it wasn't mathematicaly possible. We then tried to solve (2) c = np.linalg.solve(X, p), which was mathematically solveable, but which made numpy unhappy. If we go back to expression (1), and multiply both X and y with X.T in front, we get this: (3) c = np.linalg.solve(X.T@X, X.T@y); numpy usually accepts this, and the c we get from (3) happens to be the solution for (2). Another nice thing about this last formula is that it's relatively fast to compute relative to computing the projection matrix: we can find what c would be if we were to compute p, but without ever computing vector p (or matrix P!).

Least Squares

How close is p to y? One way to measure is with sklearn's mean_squared_error metric.

In [8]:
from sklearn.metrics import mean_squared_error
mean_squared_error(df["y"], df["p"])
Out[8]:
0.0012499999999999911

We won't prove it in this course, but it turns out that there can be no p that leads to a mathematically solveable problem that has a smaller mean_squared_error score. Thus, the kind of regression we've been doing with either sklearn's LinearRegression or equivalenly with the np.linalg.solve(X.T@X, X.T@y) expression is often called a least squares regression: https://en.wikipedia.org/wiki/Least_squares