{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Solving for N Variables Given M Equations\n", "\n", "You may be been exposed to the idea that you can usually solve for N variables when you have N equations. For regression, the coefficient we're trying to find for each column is a variable, and every row can be used to generate an equation. For tall skinny tables, then, we have a problem: more equations than variables. Sometimes this is solveable (for example, maybe some rows are repeated, such that there are fewer unique equations than one may initially assume), but most often it is not. Our strategy: tweak the y values (as little as possible!) to create a solveable problem.\n", "\n", "## Algebra Review: Solving Multiple Equations\n", "\n", "Consider the following DataFrame:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x0x1x2y
02-116.0
10112.0
20318.0
30318.1
\n", "
" ], "text/plain": [ " x0 x1 x2 y\n", "0 2 -1 1 6.0\n", "1 0 1 1 2.0\n", "2 0 3 1 8.0\n", "3 0 3 1 8.1" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "df = pd.DataFrame([[2, -1, 1, 6],\n", " [0, 1, 1, 2],\n", " [0, 3, 1, 8],\n", " [0, 3, 1, 8.1],\n", " ],\n", " columns=[\"x0\", \"x1\", \"x2\", \"y\"])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to find some coefficients relating y to the x values, as follows:\n", " \n", "`x0*c0 + x1*c1 + x2*c2 = y`\n", "\n", "Each row of df gives us some x and y values we can plug into the above formula to get an equation. For example, row 0 gives us this:\n", "\n", "`2*c0 + -1*c1 + 1*c2 = 6`\n", "\n", "Let's ignore the fourth row for now. The first three rows give us the following equations:\n", "\n", "1. `2*c0 + -1*c1 + 1*c2 = 6`\n", "2. `0*c0 + 1*c1 + 1*c2 = 2`\n", "3. `0*c0 + 3*c1 + 1*c2 = 8`\n", "\n", "If we multiply both sides of equation 2 by 3, we get this:\n", "\n", "4. `0*c0 + 3*c1 + 3*c2 = 6`\n", "\n", "If we subract equation 4 away from equation 3, we get this:\n", "\n", "5. `0*c0 + 0*c1 + -2*c2 = 2`\n", "\n", "At this point, we can solve for `c2 = -1`. We can plug this value for c2 back into equations 1 and 2 to get the following:\n", "\n", "1. `2*c0 + -1*c1 + -1 = 6`\n", "2. `0*c0 + 1*c1 + -1 = 2`\n", "\n", "Equation 2 only has one variable now, c1, and we can easily find that `c1 = 3`. Plugging that into equation 1, we further find `c0 = 5`.\n", "\n", "Solving for c given X and y in `X @ c = y` is the matrix way of expressing this system of equations, and `np.linalg.solve` can quickly find the same answer we found manually (note that the slicing excludes the last row):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 5., 3., -1.])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.linalg.solve(df.iloc[:-1][[\"x0\", \"x1\", \"x2\"]], df.iloc[:-1][\"y\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Problem: most y columns lead to contradictions for DataFrames with more rows than columns\n", "\n", "Let's look at the last two rows of the original DataFrame again:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x0x1x2y
02-116.0
10112.0
20318.0
30318.1
\n", "
" ], "text/plain": [ " x0 x1 x2 y\n", "0 2 -1 1 6.0\n", "1 0 1 1 2.0\n", "2 0 3 1 8.0\n", "3 0 3 1 8.1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting the last two rows to equations, we get:\n", "\n", "1. `0*c0 + 3*c1 + 1*c2 = 8`\n", "2. `0*c0 + 3*c1 + 1*c2 = 8.1`\n", "\n", "Of course, this is a contradiction, and there is no mathematically correct answer. But must we completely throw away the solution we found earlier? After all, maybe the 8.1 included a small measurement error. Having 4 equations and 3 variables would have been fine if the that fourth was somehow redundant with the earlier equations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution: modify the y column\n", "\n", "Our solution will be to modify the y column to turn this into a solveable problem. For the solution to our modified problem to be meaningful to our original problem, the modifications to y will need to be as small as possible.\n", "\n", "We'll call our modified y column p. We'll compute p as `p = P @ y`. `P` is what we call a projection matrix. A projection matrix has two properties. (1) When we multiply it by an vector using the dot product, we'll get back a vector that can be used in the y column and will lead to a solveable problem. (2) the vector we get back is as close to the original vector as possible.\n", "\n", "The values in a projection matrix P will depend on the X values we're working with. We won't derive the formula here (320 isn't a math class, after all), but it can be computed as follows:\n", "\n", "`P = X @ np.linalg.inv(X.T @ X) @ X.T`.\n", "\n", "`np.linalg.inv` computes the inverse of a matrix (matrix inverses are a topic covered in any linear algebra course, but we won't delve into it here). `.T` is transpose, which means a swap of rows and columns:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
x02.00.00.00.0
x1-1.01.03.03.0
x21.01.01.01.0
y6.02.08.08.1
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "x0 2.0 0.0 0.0 0.0\n", "x1 -1.0 1.0 3.0 3.0\n", "x2 1.0 1.0 1.0 1.0\n", "y 6.0 2.0 8.0 8.1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's write a function to compute a projection matrix:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1. , 0. , 0. , 0. ],\n", " [0. , 1. , 0. , 0. ],\n", " [0. , 0. , 0.5, 0.5],\n", " [0. , 0. , 0.5, 0.5]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def projection_matrix(X):\n", " return X @ np.linalg.inv(X.T @ X) @ X.T\n", "\n", "P = projection_matrix(df[[\"x0\", \"x1\", \"x2\"]].values)\n", "P" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x0x1x2yp
02-116.06.00
10112.02.00
20318.08.05
30318.18.05
\n", "