Data Science from Scratch: First Principles with Python (2015)

Chapter 14.Simple Linear Regression

Art, like morality, consists in drawing the line somewhere.

G. K. Chesterton

InChapter 5, we used thecorrelationfunction tomeasure the strength of the linear relationship between two variables. For most applications, knowing that such a linear relationship exists isn’t enough. We’ll want to be able to understand the nature of the relationship. This is where we’ll use simple linear regression.

The Model

Recall that we were investigating the relationship between a DataSciencester user’s number of friends and the amountof time he spent on the site each day.Let’s assume that you’ve convinced yourself that having more friendscausespeople to spend more time on the site, rather than one of the alternative explanations we discussed.

The VP of Engagement asks you to build a model describing this relationship. Since you found a pretty strong linear relationship, a natural place to start is a linear model.

In particular, you hypothesize that there are constants(alpha) and(beta) such that:

whereis the number of minutes userispends on the site daily,is the number of friends userihas, andis a (hopefully small) error term representing the fact that there are other factors not accounted for by this simple model.

Assuming we’ve determined such analphaandbeta, then we make predictions simply with:

defpredict(alpha,beta,x_i):

returnbeta*x_i+alpha

How do we choosealphaandbeta? Well, any choice ofalphaandbetagives us a predicted output for each inputx_i. Since we know the actual outputy_iwe can compute the error for each pair:

deferror(alpha,beta,x_i,y_i):

"""the error from predicting beta * x_i + alpha

when the actual value is y_i"""

returny_i-predict(alpha,beta,x_i)

What we’d really like to know is the total error over the entire data set.But we don’t want to just add the errors — if the prediction forx_1is too high and the prediction forx_2is too low, the errors may just cancel out.

So instead we add up thesquarederrors:

defsum_of_squared_errors(alpha,beta,x,y):

returnsum(error(alpha,beta,x_i,y_i)**2

forx_i,y_iinzip(x,y))

Theleast squares solutionisto choose thealphaandbetathat makesum_of_squared_errorsas small as possible.

Using calculus (or tedious algebra), the error-minimizing alpha and beta are given by:

defleast_squares_fit(x,y):

"""given training values for x and y,

find the least-squares values of alpha and beta"""

Without going through the exact mathematics, let’s think about why this might be a reasonable solution. The choice ofalphasimply says that when we see the average value of the independent variablex, we predict the average value of the dependent variabley.

The choice ofbetameans that when the input value increases bystandard_deviation(x), the prediction increases bycorrelation(x, y) * standard_deviation(y).In the case whenxandyare perfectly correlated, a one standard deviation increase inxresults in a one-standard-deviation-of-yincrease in the prediction. When they’re perfectly anticorrelated, the increase inxresults in adecreasein the prediction. And when the correlation is zero,betais zero, which means that changes inxdon’t affect the prediction at all.

It’s easy to apply this to the outlierless data fromChapter 5:

This gives values of alpha = 22.95 and beta = 0.903. So our model says that we expect a user withnfriends to spend22.95 + n * 0.903minutes on the site each day. That is, we predict that a user with no friends on DataSciencester would still spend about 23 minutes a day on the site. And for each additional friend, we expect a user to spend almost a minute more on the site each day.

InFigure 14-1, we plot the prediction line to get a sense of how well the model fits the observed data.

Of course, we need a better way to figure out how well we’ve fit the data than staring at the graph. A common measureis thecoefficient of determination(orR-squared), which measures the fraction of the total variation in the dependent variable that is captured by the model:

deftotal_sum_of_squares(y):

"""the total squared variation of y_i's from their mean"""

returnsum(v**2forvinde_mean(y))

defr_squared(alpha,beta,x,y):

"""the fraction of variation in y captured by the model, which equals

1 - the fraction of variation in y not captured by the model"""

Now, we chose thealphaandbetathat minimized the sum of the squared prediction errors. One linear model we could have chosen is “always predictmean(y)” (corresponding toalpha = mean(y)andbeta = 0), whose sum of squared errors exactly equals its total sum of squares. This means an R-squared of zero, which indicates a model that (obviously, in this case) performs no better than just predicting the mean.

Clearly, the least squares model must be at least as good as that one, which means that the sum of the squared errors isat mostthe total sum of squares, which means that the R-squared must be at least zero. And the sum of squared errors must be at least 0, which means that the R-squared can be at most 1.

The higher the number, the better our model fits the data. Here we calculate an R-squared of 0.329, which tells us that our model is only sort of okay at fitting the data, and that clearly there are other factors at play.

Using Gradient Descent

If we writetheta = [alpha, beta], then we canalso solve this using gradient descent:

Using the same data we get alpha = 22.93, beta = 0.905, which are very close to the exact answers.

Maximum Likelihood Estimation

Why choose least squares? One justification involvesmaximum likelihood estimation.

Imagine that we have a sample of datathat comes from a distribution that depends on some unknown parameter:

If we didn’t know theta, wecould turn around and think of this quantity as thelikelihoodofgiven the sample:

Under this approach, the most likelyis the value that maximizes this likelihood function; that is, the value that makes the observed data the most probable. In the case of a continuous distribution, in which we have a probability distribution function rather than a probability mass function, we can do the same thing.

Back to regression. One assumption that’s often made about the simple regression model is that the regression errorsare normally distributed with mean 0 and some (known) standard deviation. If that’s the case, then the likelihood based on seeing a pair(x_i, y_i)is:

The likelihood based on the entire data set is the product of the individual likelihoods, which is largest precisely whenalphaandbetaare chosen to minimize the sum of squared errors. That is, in this case (and with these assumptions), minimizing the sum of squared errors is equivalent to maximizing the likelihood of the observed data.

For Further Exploration

Continue reading about multiple regression inChapter 15!