{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# predict and fit, using numpy\n", "\n", "In this reading, we'll see how `np.dot` (often expressed with the `@` operator) and `np.linalg.solve` relate to `predict` and `fit` respectively for sklearn's LinearRegression. \n", "\n", "Say we've seen a few houses sell recently, with the following characteristics (features) and prices (label):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bedsbathsyearprice
0211985196.55
1311998260.56
2432005334.55
3422020349.60
\n", "
" ], "text/plain": [ " beds baths year price\n", "0 2 1 1985 196.55\n", "1 3 1 1998 260.56\n", "2 4 3 2005 334.55\n", "3 4 2 2020 349.60" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn.linear_model import LinearRegression\n", "\n", "train = pd.DataFrame([[2,1,1985,196.55],\n", " [3,1,1998,260.56],\n", " [4,3,2005,334.55],\n", " [4,2,2020,349.6]],\n", " columns=[\"beds\", \"baths\", \"year\", \"price\"])\n", "train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we fit a LinearRegression model to the above, then use it to predict prices for the following three houses that haven't sold yet?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bedsbathsyear
0221999
1511950
2342000
\n", "
" ], "text/plain": [ " beds baths year\n", "0 2 2 1999\n", "1 5 1 1950\n", "2 3 4 2000" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "live = pd.DataFrame([[2,2,1999],\n", " [5,1,1950],\n", " [3,4,2000]],\n", " columns=[\"beds\", \"baths\", \"year\"])\n", "live" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr = LinearRegression()\n", "lr.fit(train[[\"beds\", \"baths\", \"year\"]], train[\"price\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([229.93, 265. , 293.9 ])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.predict(live)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above tells us that the model thinks the three houses will sell for \\\\$229.93K, \\\\$265K, and \\\\$293.9K, respectively. Underlying this prediction was a dot product based on some arrays calculated during fitting." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([42.3 , 10. , 1.67])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = lr.coef_\n", "c" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-3213.000000000003" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = lr.intercept_\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's pull the features from the first row of the live data into an array too." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 2, 2, 1999])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "house = live.iloc[0].values\n", "house" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The array we pulled from `lr` is called coef_ because those numbers are meant to be coefficients on the features for each house." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "229.9300000000003" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c[0]*house[0] + c[1]*house[1]+ c[2]*house[2] + b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That was the same amount predicted by `lr.predict` for the first house! Better, if we put our houses in the right shape, we can simplify the expression to a dot product." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[229.93]])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "house = house.reshape(1,-1)\n", "c = c.reshape(-1,1)\n", "np.dot(house, c) + b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Same thing as before! Or using the `@` operator, which is a shorthand for `np.dot`:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[229.93]])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "house @ c + b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've seen how to do `row @ col`. If we do `matrix @ col`, you can think of it as looping over each row in matrix, computing the dot product of each row with col, then stacking the results in the output. This means we can do all the predictions at once!" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 2, 2, 1999],\n", " [ 5, 1, 1950],\n", " [ 3, 4, 2000]])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "live.values" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[229.93],\n", " [265. ],\n", " [293.9 ]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "live.values @ c + b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that these are the same values that LinearRegression predicted earlier -- it's just using the dot product internally:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([229.93, 265. , 293.9 ])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr.predict(live)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Training\n", "\n", "Ok, how did `fit` determine what values to use in `coef_` and `intercept_`? Let's think about how we could use X and y values extracted from our training data:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[2.000e+00, 1.000e+00, 1.985e+03],\n", " [3.000e+00, 1.000e+00, 1.998e+03],\n", " [4.000e+00, 3.000e+00, 2.005e+03],\n", " [4.000e+00, 2.000e+00, 2.020e+03]])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = train.values[:, :-1]\n", "X" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[196.55],\n", " [260.56],\n", " [334.55],\n", " [349.6 ]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = train.values[:, -1:]\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We know that for predictions, LinearAlgebra wants to use this:\n", "\n", "`X @ coef_ + intercept_ = y`\n", "\n", "`coef_` is a vector and `intercept_` is a single number; we can eliminate `intercept_` as a separate variable if we add an entry to `coef_` and add a column of ones to X." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[2.000e+00, 1.000e+00, 1.985e+03, 1.000e+00],\n", " [3.000e+00, 1.000e+00, 1.998e+03, 1.000e+00],\n", " [4.000e+00, 3.000e+00, 2.005e+03, 1.000e+00],\n", " [4.000e+00, 2.000e+00, 2.020e+03, 1.000e+00]])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = np.concatenate((train.values[:, :-1], np.ones((len(train), 1))), axis=1)\n", "X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives us this simple equation:\n", "\n", "`X @ coef_ = y`\n", "\n", "We know `X` and `y` (from the `train` DataFrame) -- can we use those to solve for `coef_`? If the dot product were a regular multiplication, we would divide both sides by X, but that's not valid for matrices and the dot product. Solving is a little trickier, but fortunately numpy can do it for us:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[42.29999999999925, 10.000000000000181, 1.6700000000000526, -3213.000000000103]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Solve's for c in X@c=y, given X and y as arguments\n", "c = np.linalg.solve(X, y)\n", "list(c.reshape(-1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That contains the same coefficients and intercept that `LinearRegression.fit` found earlier:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[42.29999999999998, 9.999999999999984, 1.6700000000000017] -3213.000000000003\n" ] } ], "source": [ "print(list(lr.coef_), lr.intercept_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does `np.linalg.solve` work? It is solving a system of N equations and N variables. It turns out it is possible to convert a table to such an algebra problem, converting each row to an equation and each column to a variable.\n", "\n", "Of course, this means it only works for square tables (same number of rows and columns), such as the sub-table of `train` that contains features, if we were to add a column of ones:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bedsbathsyear
0211985
1311998
2432005
3422020
\n", "
" ], "text/plain": [ " beds baths year\n", "0 2 1 1985\n", "1 3 1 1998\n", "2 4 3 2005\n", "3 4 2 2020" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train[[\"beds\", \"baths\", \"year\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One implication is that `np.linalg.solve` won't work for us if there are M rows and N columns, where M > N. This would be solving a system of M equations with only N variables. It is rarely possible to solve for **correct** solutions in such cases. In the near future, however, we'll learn how to solve for **good** solutions (\"good\" remains to be defined) for systems of M equations and N variables. Of course, most of the tables you've worked with probably have more rows than columns, so this is a very important problem." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }