predict and fit, using numpy

In this reading, we'll see how np.dot (often expressed with the @ operator) and np.linalg.solve relate to predict and fit respectively for sklearn's LinearRegression.

Say we've seen a few houses sell recently, with the following characteristics (features) and prices (label):

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

train = pd.DataFrame([[2,1,1985,196.55],
                      [3,1,1998,260.56],
                      [4,3,2005,334.55],
                      [4,2,2020,349.6]],
                     columns=["beds", "baths", "year", "price"])
train
Out[1]:
beds baths year price
0 2 1 1985 196.55
1 3 1 1998 260.56
2 4 3 2005 334.55
3 4 2 2020 349.60

Can we fit a LinearRegression model to the above, then use it to predict prices for the following three houses that haven't sold yet?

In [2]:
live = pd.DataFrame([[2,2,1999],
                     [5,1,1950],
                     [3,4,2000]],
                    columns=["beds", "baths", "year"])
live
Out[2]:
beds baths year
0 2 2 1999
1 5 1 1950
2 3 4 2000
In [3]:
lr = LinearRegression()
lr.fit(train[["beds", "baths", "year"]], train["price"])
Out[3]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Predicting

In [4]:
lr.predict(live)
Out[4]:
array([229.93, 265.  , 293.9 ])

The above tells us that the model thinks the three houses will sell for \$229.93K, \\$265K, and \$293.9K, respectively. Underlying this prediction was a dot product based on some arrays calculated during fitting.

In [5]:
c = lr.coef_
c
Out[5]:
array([42.3 , 10.  ,  1.67])
In [6]:
b = lr.intercept_
b
Out[6]:
-3213.000000000003

Let's pull the features from the first row of the live data into an array too.

In [7]:
house = live.iloc[0].values
house
Out[7]:
array([   2,    2, 1999])

The array we pulled from lr is called coef_ because those numbers are meant to be coefficients on the features for each house.

In [8]:
c[0]*house[0] + c[1]*house[1]+ c[2]*house[2] + b
Out[8]:
229.9300000000003

That was the same amount predicted by lr.predict for the first house! Better, if we put our houses in the right shape, we can simplify the expression to a dot product.

In [9]:
house = house.reshape(1,-1)
c = c.reshape(-1,1)
np.dot(house, c) + b
Out[9]:
array([[229.93]])

Same thing as before! Or using the @ operator, which is a shorthand for np.dot:

In [10]:
house @ c + b
Out[10]:
array([[229.93]])

We've seen how to do row @ col. If we do matrix @ col, you can think of it as looping over each row in matrix, computing the dot product of each row with col, then stacking the results in the output. This means we can do all the predictions at once!

In [11]:
live.values
Out[11]:
array([[   2,    2, 1999],
       [   5,    1, 1950],
       [   3,    4, 2000]])
In [12]:
live.values @ c + b
Out[12]:
array([[229.93],
       [265.  ],
       [293.9 ]])

Recall that these are the same values that LinearRegression predicted earlier -- it's just using the dot product internally:

In [13]:
lr.predict(live)
Out[13]:
array([229.93, 265.  , 293.9 ])

Training

Ok, how did fit determine what values to use in coef_ and intercept_? Let's think about how we could use X and y values extracted from our training data:

In [14]:
X = train.values[:, :-1]
X
Out[14]:
array([[2.000e+00, 1.000e+00, 1.985e+03],
       [3.000e+00, 1.000e+00, 1.998e+03],
       [4.000e+00, 3.000e+00, 2.005e+03],
       [4.000e+00, 2.000e+00, 2.020e+03]])
In [15]:
y = train.values[:, -1:]
y
Out[15]:
array([[196.55],
       [260.56],
       [334.55],
       [349.6 ]])

We know that for predictions, LinearAlgebra wants to use this:

X @ coef_ + intercept_ = y

coef_ is a vector and intercept_ is a single number; we can eliminate intercept_ as a separate variable if we add an entry to coef_ and add a column of ones to X.

In [16]:
X = np.concatenate((train.values[:, :-1], np.ones((len(train), 1))), axis=1)
X
Out[16]:
array([[2.000e+00, 1.000e+00, 1.985e+03, 1.000e+00],
       [3.000e+00, 1.000e+00, 1.998e+03, 1.000e+00],
       [4.000e+00, 3.000e+00, 2.005e+03, 1.000e+00],
       [4.000e+00, 2.000e+00, 2.020e+03, 1.000e+00]])

This gives us this simple equation:

X @ coef_ = y

We know X and y (from the train DataFrame) -- can we use those to solve for coef_? If the dot product were a regular multiplication, we would divide both sides by X, but that's not valid for matrices and the dot product. Solving is a little trickier, but fortunately numpy can do it for us:

In [17]:
# Solve's for c in X@c=y, given X and y as arguments
c = np.linalg.solve(X, y)
list(c.reshape(-1))
Out[17]:
[42.29999999999925, 10.000000000000181, 1.6700000000000526, -3213.000000000103]

That contains the same coefficients and intercept that LinearRegression.fit found earlier:

In [18]:
print(list(lr.coef_), lr.intercept_)
[42.29999999999998, 9.999999999999984, 1.6700000000000017] -3213.000000000003

How does np.linalg.solve work? It is solving a system of N equations and N variables. It turns out it is possible to convert a table to such an algebra problem, converting each row to an equation and each column to a variable.

Of course, this means it only works for square tables (same number of rows and columns), such as the sub-table of train that contains features, if we were to add a column of ones:

In [19]:
train[["beds", "baths", "year"]]
Out[19]:
beds baths year
0 2 1 1985
1 3 1 1998
2 4 3 2005
3 4 2 2020

One implication is that np.linalg.solve won't work for us if there are M rows and N columns, where M > N. This would be solving a system of M equations with only N variables. It is rarely possible to solve for correct solutions in such cases. In the near future, however, we'll learn how to solve for good solutions ("good" remains to be defined) for systems of M equations and N variables. Of course, most of the tables you've worked with probably have more rows than columns, so this is a very important problem.