\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" x0 x1 x2 y\n",
"0 2 -1 1 6.0\n",
"1 0 1 1 2.0\n",
"2 0 3 1 8.0\n",
"3 0 3 1 8.1"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"df = pd.DataFrame([[2, -1, 1, 6],\n",
" [0, 1, 1, 2],\n",
" [0, 3, 1, 8],\n",
" [0, 3, 1, 8.1],\n",
" ],\n",
" columns=[\"x0\", \"x1\", \"x2\", \"y\"])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to find some coefficients relating y to the x values, as follows:\n",
" \n",
"`x0*c0 + x1*c1 + x2*c2 = y`\n",
"\n",
"Each row of df gives us some x and y values we can plug into the above formula to get an equation. For example, row 0 gives us this:\n",
"\n",
"`2*c0 + -1*c1 + 1*c2 = 6`\n",
"\n",
"Let's ignore the fourth row for now. The first three rows give us the following equations:\n",
"\n",
"1. `2*c0 + -1*c1 + 1*c2 = 6`\n",
"2. `0*c0 + 1*c1 + 1*c2 = 2`\n",
"3. `0*c0 + 3*c1 + 1*c2 = 8`\n",
"\n",
"If we multiply both sides of equation 2 by 3, we get this:\n",
"\n",
"4. `0*c0 + 3*c1 + 3*c2 = 6`\n",
"\n",
"If we subract equation 4 away from equation 3, we get this:\n",
"\n",
"5. `0*c0 + 0*c1 + -2*c2 = 2`\n",
"\n",
"At this point, we can solve for `c2 = -1`. We can plug this value for c2 back into equations 1 and 2 to get the following:\n",
"\n",
"1. `2*c0 + -1*c1 + -1 = 6`\n",
"2. `0*c0 + 1*c1 + -1 = 2`\n",
"\n",
"Equation 2 only has one variable now, c1, and we can easily find that `c1 = 3`. Plugging that into equation 1, we further find `c0 = 5`.\n",
"\n",
"Solving for c given X and y in `X @ c = y` is the matrix way of expressing this system of equations, and `np.linalg.solve` can quickly find the same answer we found manually (note that the slicing excludes the last row):"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 5., 3., -1.])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.linalg.solve(df.iloc[:-1][[\"x0\", \"x1\", \"x2\"]], df.iloc[:-1][\"y\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Problem: most y columns lead to contradictions for DataFrames with more rows than columns\n",
"\n",
"Let's look at the last two rows of the original DataFrame again:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"x0 | x1 | x2 | y | |
---|---|---|---|---|

0 | 2 | -1 | 1 | 6.0 |

1 | 0 | 1 | 1 | 2.0 |

2 | 0 | 3 | 1 | 8.0 |

3 | 0 | 3 | 1 | 8.1 |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" x0 x1 x2 y\n",
"0 2 -1 1 6.0\n",
"1 0 1 1 2.0\n",
"2 0 3 1 8.0\n",
"3 0 3 1 8.1"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting the last two rows to equations, we get:\n",
"\n",
"1. `0*c0 + 3*c1 + 1*c2 = 8`\n",
"2. `0*c0 + 3*c1 + 1*c2 = 8.1`\n",
"\n",
"Of course, this is a contradiction, and there is no mathematically correct answer. But must we completely throw away the solution we found earlier? After all, maybe the 8.1 included a small measurement error. Having 4 equations and 3 variables would have been fine if the that fourth was somehow redundant with the earlier equations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Solution: modify the y column\n",
"\n",
"Our solution will be to modify the y column to turn this into a solveable problem. For the solution to our modified problem to be meaningful to our original problem, the modifications to y will need to be as small as possible.\n",
"\n",
"We'll call our modified y column p. We'll compute p as `p = P @ y`. `P` is what we call a projection matrix. A projection matrix has two properties. (1) When we multiply it by an vector using the dot product, we'll get back a vector that can be used in the y column and will lead to a solveable problem. (2) the vector we get back is as close to the original vector as possible.\n",
"\n",
"The values in a projection matrix P will depend on the X values we're working with. We won't derive the formula here (320 isn't a math class, after all), but it can be computed as follows:\n",
"\n",
"`P = X @ np.linalg.inv(X.T @ X) @ X.T`.\n",
"\n",
"`np.linalg.inv` computes the inverse of a matrix (matrix inverses are a topic covered in any linear algebra course, but we won't delve into it here). `.T` is transpose, which means a swap of rows and columns:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"x0 | x1 | x2 | y | |
---|---|---|---|---|

0 | 2 | -1 | 1 | 6.0 |

1 | 0 | 1 | 1 | 2.0 |

2 | 0 | 3 | 1 | 8.0 |

3 | 0 | 3 | 1 | 8.1 |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" 0 1 2 3\n",
"x0 2.0 0.0 0.0 0.0\n",
"x1 -1.0 1.0 3.0 3.0\n",
"x2 1.0 1.0 1.0 1.0\n",
"y 6.0 2.0 8.0 8.1"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's write a function to compute a projection matrix:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1. , 0. , 0. , 0. ],\n",
" [0. , 1. , 0. , 0. ],\n",
" [0. , 0. , 0.5, 0.5],\n",
" [0. , 0. , 0.5, 0.5]])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def projection_matrix(X):\n",
" return X @ np.linalg.inv(X.T @ X) @ X.T\n",
"\n",
"P = projection_matrix(df[[\"x0\", \"x1\", \"x2\"]].values)\n",
"P"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"0 | 1 | 2 | 3 | |
---|---|---|---|---|

x0 | 2.0 | 0.0 | 0.0 | 0.0 |

x1 | -1.0 | 1.0 | 3.0 | 3.0 |

x2 | 1.0 | 1.0 | 1.0 | 1.0 |

y | 6.0 | 2.0 | 8.0 | 8.1 |

\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" x0 x1 x2 y p\n",
"0 2 -1 1 6.0 6.00\n",
"1 0 1 1 2.0 2.00\n",
"2 0 3 1 8.0 8.05\n",
"3 0 3 1 8.1 8.05"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"p\"] = P @ df[\"y\"]\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See what happened? The projection matrix turned this into a solveable problem! p is close to y, but whereas trying to solve `X @ c = y` lead to the following contradiction:\n",
"\n",
"* `0*c0 + 3*c1 + 1*c2 = 8`\n",
"* `0*c0 + 3*c1 + 1*c2 = 8.1`\n",
"\n",
"Now we get this:\n",
"\n",
"* `0*c0 + 3*c1 + 1*c2 = 8.05`\n",
"* `0*c0 + 3*c1 + 1*c2 = 8.05`\n",
"\n",
"Athough we now have a problem with a mathematical solution, `np.linalg.solve` still only works with square matrices (this is not a fundamental limitation -- you could imagine building a better version of `solve` that works for any mathematically solveable problem):"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Last 2 dimensions of the array must be square\n"
]
}
],
"source": [
"try:\n",
" c = np.linalg.solve(df[[\"x0\", \"x1\", \"x2\"]], df[\"p\"])\n",
"except Exception as e:\n",
" print(e)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It turns out that the solution is simple, even though we don't derive the formula here. We originally wanted to solve (1) `c = np.linalg.solve(X, y)`, but it wasn't mathematicaly possible. We then tried to solve (2) `c = np.linalg.solve(X, p)`, which was mathematically solveable, but which made numpy unhappy. If we go back to expression (1), and multiply both X and y with X.T in front, we get this: (3) `c = np.linalg.solve(X.T@X, X.T@y)`; numpy usually accepts this, and the c we get from (3) happens to be the solution for (2). Another nice thing about this last formula is that it's relatively fast to compute relative to computing the projection matrix: we can find what c would be if we were to compute p, but without ever computing vector p (or matrix P!)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Least Squares\n",
"\n",
"How close is p to y? One way to measure is with sklearn's mean_squared_error metric."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0012499999999999911"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import mean_squared_error\n",
"mean_squared_error(df[\"y\"], df[\"p\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We won't prove it in this course, but it turns out that there can be no p that leads to a mathematically solveable problem that has a smaller `mean_squared_error` score. Thus, the kind of regression we've been doing with either sklearn's LinearRegression or equivalenly with the `np.linalg.solve(X.T@X, X.T@y)` expression is often called a *least squares regression*: https://en.wikipedia.org/wiki/Least_squares"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
x0 | x1 | x2 | y | p | |
---|---|---|---|---|---|

0 | 2 | -1 | 1 | 6.0 | 6.00 |

1 | 0 | 1 | 1 | 2.0 | 2.00 |

2 | 0 | 3 | 1 | 8.0 | 8.05 |

3 | 0 | 3 | 1 | 8.1 | 8.05 |