Data Science from Scratch: First Principles with Python (2015)

Chapter 15. Multiple Regression

I don’t look at a problem and put variables in there that don’t affect it.

Bill Parcells

Although the VP is pretty impressed with your predictive model, she thinks you can do better. To that end, you’ve collected additional data: for each of your users, you know how many hours he works each day, and whether he has a PhD. You’d like to use this additional data to improve your model.

Accordingly, you hypothesize a linear model with more independent variables:

Obviously, whether a user has a PhD is not a number, but — as we mentioned in Chapter 11 — we can introduce a dummy variable that equals 1 for users with PhDs and 0 for users without, after which it’s just as numeric as the other variables.

The Model

Recall that in Chapter 14 we fit a model of the form:

Now imagine that each input  is not a single number but rather a vector of k numbers . The multiple regression model assumes that:

In multiple regression the vector of parameters is usually called . We’ll want this to include the constant term as well, which we can achieve by adding a column of ones to our data:

beta = [alpha, beta_1, ..., beta_k]

and:

x_i = [1, x_i1, ..., x_ik]

Then our model is just:

def predict(x_i, beta):
    """assumes that the first element of each x_i is 1"""
    return dot(x_i, beta)

In this particular case, our independent variable x will be a list of vectors, each of which looks like this:

[1,    # constant term
 49,   # number of friends
 4,    # work hours per day
 0]    # doesn't have PhD

Further Assumptions of the Least Squares Model

There are a couple of further assumptions that are required for this model (and our solution) to make sense.

The first is that the columns of x are linearly independent — that there’s no way to write any one as a weighted sum of some of the others. If this assumption fails, it’s impossible to estimate beta. To see this in an extreme case, imagine we had an extra field num_acquaintances in our data that for every user was exactly equal to num_friends.

Then, starting with any beta, if we add any amount to the num_friends coefficient and subtract that same amount from the num_acquaintances coefficient, the model’s predictions will remain unchanged. Which means that there’s no way to find the coefficient for num_friends. (Usually violations of this assumption won’t be so obvious.)

The second important assumption is that the columns of x are all uncorrelated with the errors . If this fails to be the case, our estimates of beta will be systematically wrong.

For instance, in Chapter 14, we built a model that predicted that each additional friend was associated with an extra 0.90 daily minutes on the site.

Imagine that it’s also the case that:

§  People who work more hours spend less time on the site.

§  People with more friends tend to work more hours.

That is, imagine that the “actual” model is:

and that work hours and friends are positively correlated. In that case, when we minimize the errors of the single variable model:

we will underestimate .

Think about what would happen if we made predictions using the single variable model with the “actual” value of . (That is, the value that arises from minimizing the errors of what we called the “actual” model.) The predictions would tend to be too small for users who work many hours and too large for users who work few hours, because  and we “forgot” to include it. Because work hours is positively correlated with number of friends, this means the predictions tend to be too small for users with many friends and too large for users with few friends.

The result of this is that we can reduce the errors (in the single-variable model) by decreasing our estimate of , which means that the error-minimizing  is smaller than the “actual” value. That is, in this case the single-variable least squares solution is biased to underestimate . And, in general, whenever the independent variables are correlated with the errors like this, our least squares solution will give us a biased estimate of .

Fitting the Model

As we did in the simple linear model, we’ll choose beta to minimize the sum of squared errors. Finding an exact solution is not simple to do by hand, which means we’ll need to use gradient descent. We’ll start by creating an error function to minimize. For stochastic gradient descent, we’ll just want the squared error corresponding to a single prediction:

def error(x_i, y_i, beta):
    return y_i - predict(x_i, beta)
 
def squared_error(x_i, y_i, beta):
    return error(x_i, y_i, beta) ** 2

If you know calculus, you can compute:

def squared_error_gradient(x_i, y_i, beta):
    """the gradient (with respect to beta)
    corresponding to the ith squared error term"""
    return [-2 * x_ij * error(x_i, y_i, beta)
            for x_ij in x_i]

Otherwise, you’ll need to take my word for it.

At this point, we’re ready to find the optimal beta using stochastic gradient descent:

def estimate_beta(x, y):
    beta_initial = [random.random() for x_i in x[0]]
    return minimize_stochastic(squared_error,
                               squared_error_gradient,
                               x, y,
                               beta_initial,
                               0.001)
 
random.seed(0)
beta = estimate_beta(x, daily_minutes_good) # [30.63, 0.972, -1.868, 0.911]

This means our model looks like:

 

Interpreting the Model

You should think of the coefficients of the model as representing all-else-being-equal estimates of the impacts of each factor. All else being equal, each additional friend corresponds to an extra minute spent on the site each day. All else being equal, each additional hour in a user’s workday corresponds to about two fewer minutes spent on the site each day. All else being equal, having a PhD is associated with spending an extra minute on the site each day.

What this doesn’t (directly) tell us is anything about the interactions among the variables. It’s possible that the effect of work hours is different for people with many friends than it is for people with few friends. This model doesn’t capture that. One way to handle this case is to introduce a new variable that is the product of “friends” and “work hours.” This effectively allows the “work hours” coefficient to increase (or decrease) as the number of friends increases.

Or it’s possible that the more friends you have, the more time you spend on the site up to a point, after which further friends cause you to spend less time on the site. (Perhaps with too many friends the experience is just too overwhelming?) We could try to capture this in our model by adding another variable that’s the square of the number of friends.

Once we start adding variables, we need to worry about whether their coefficients “matter.” There are no limits to the numbers of products, logs, squares, and higher powers we could add.

Goodness of Fit

Again we can look at the R-squared, which has now increased to 0.68:

def multiple_r_squared(x, y, beta):
    sum_of_squared_errors = sum(error(x_i, y_i, beta) ** 2
                                for x_i, y_i in zip(x, y))
    return 1.0 - sum_of_squared_errors / total_sum_of_squares(y)

Keep in mind, however, that adding new variables to a regression will necessarily increase the R-squared. After all, the simple regression model is just the special case of the multiple regression model where the coefficients on “work hours” and “PhD” both equal 0. The optimal multiple regression model will necessarily have an error at least as small as that one.

Because of this, in a multiple regression, we also need to look at the standard errors of the coefficients, which measure how certain we are about our estimates of each . The regression as a whole may fit our data very well, but if some of the independent variables are correlated (or irrelevant), their coefficients might not mean much.

The typical approach to measuring these errors starts with another assumption — that the errors  are independent normal random variables with mean 0 and some shared (unknown) standard deviation . In that case, we (or, more likely, our statistical software) can use some linear algebra to find the standard error of each coefficient. The larger it is, the less sure our model is about that coefficient. Unfortunately, we’re not set up to do that kind of linear algebra from scratch.

Digression: The Bootstrap

Imagine we have a sample of n data points, generated by some (unknown to us) distribution:

data = get_sample(num_points=n)

In Chapter 5, we wrote a function to compute the median of the observed data, which we can use as an estimate of the median of the distribution itself.

But how confident can we be about our estimate? If all the data in the sample are very close to 100, then it seems likely that the actual median is close to 100. If approximately half the data in the sample is close to 0 and the other half is close to 200, then we can’t be nearly as certain about the median.

If we could repeatedly get new samples, we could compute the median of each and look at the distribution of those medians. Usually we can’t. What we can do instead is bootstrap new data sets by choosing n data points with replacement from our data and then compute the medians of those synthetic data sets:

def bootstrap_sample(data):
    """randomly samples len(data) elements with replacement"""
    return [random.choice(data) for _ in data]
 
def bootstrap_statistic(data, stats_fn, num_samples):
    """evaluates stats_fn on num_samples bootstrap samples from data"""
    return [stats_fn(bootstrap_sample(data))
            for _ in range(num_samples)]

For example, consider the two following data sets:

# 101 points all very close to 100
close_to_100 = [99.5 + random.random() for _ in range(101)]
 
# 101 points, 50 of them near 0, 50 of them near 200
far_from_100 = ([99.5 + random.random()] +
                [random.random() for _ in range(50)] +
                [200 + random.random() for _ in range(50)])

If you compute the median of each, both will be very close to 100. However, if you look at:

bootstrap_statistic(close_to_100, median, 100)

you will mostly see numbers really close to 100. Whereas if you look at:

bootstrap_statistic(far_from_100, median, 100)

you will see a lot of numbers close to 0 and a lot of numbers close to 200.

The standard_deviation of the first set of medians is close to 0, while the standard_deviation of the second set of medians is close to 100. (This extreme a case would be pretty easy to figure out by manually inspecting the data, but in general that won’t be true.)

Standard Errors of Regression Coefficients

We can take the same approach to estimating the standard errors of our regression coefficients. We repeatedly take a bootstrap_sample of our data and estimate beta based on that sample. If the coefficient corresponding to one of the independent variables (say num_friends) doesn’t vary much across samples, then we can be confident that our estimate is relatively tight. If the coefficient varies greatly across samples, then we can’t be at all confident in our estimate.

The only subtlety is that, before sampling, we’ll need to zip our x data and y data to make sure that corresponding values of the independent and dependent variables are sampled together. This means that bootstrap_sample will return a list of pairs (x_i, y_i), which we’ll need to reassemble into an x_sample and a y_sample:

def estimate_sample_beta(sample):
    """sample is a list of pairs (x_i, y_i)"""
    x_sample, y_sample = zip(*sample) # magic unzipping trick
    return estimate_beta(x_sample, y_sample)
 
random.seed(0) # so that you get the same results as me
 
bootstrap_betas = bootstrap_statistic(zip(x, daily_minutes_good),
                                      estimate_sample_beta,
                                      100)

After which we can estimate the standard deviation of each coefficient:

bootstrap_standard_errors = [
    standard_deviation([beta[i] for beta in bootstrap_betas])
    for i in range(4)]
 
# [1.174,    # constant term, actual error = 1.19
#  0.079,    # num_friends,   actual error = 0.080
#  0.131,    # unemployed,    actual error = 0.127
#  0.990]    # phd,           actual error = 0.998

We can use these to test hypotheses such as “does  equal zero?” Under the null hypothesis  (and with our other assumptions about the distribution of  ) the statistic:

which is our estimate of  divided by our estimate of its standard error, follows a Student’s t-distribution with “ degrees of freedom.”

If we had a students_t_cdf function, we could compute p-values for each least-squares coefficient to indicate how likely we would be to observe such a value if the actual coefficient were zero. Unfortunately, we don’t have such a function. (Although we would if we weren’t working from scratch.)

However, as the degrees of freedom get large, the t-distribution gets closer and closer to a standard normal. In a situation like this, where n is much larger than k, we can use normal_cdf and still feel good about ourselves:

def p_value(beta_hat_j, sigma_hat_j):
    if beta_hat_j > 0:
        # if the coefficient is positive, we need to compute twice the
        # probability of seeing an even *larger* value
        return 2 * (1 - normal_cdf(beta_hat_j / sigma_hat_j))
    else:
        # otherwise twice the probability of seeing a *smaller* value
        return 2 * normal_cdf(beta_hat_j / sigma_hat_j)
 
p_value(30.63, 1.174)    # ~0   (constant term)
p_value(0.972, 0.079)    # ~0   (num_friends)
p_value(-1.868, 0.131)   # ~0   (work_hours)
p_value(0.911, 0.990)    # 0.36 (phd)

(In a situation not like this, we would probably be using statistical software that knows how to compute the t-distribution, as well as how to compute the exact standard errors.)

While most of the coefficients have very small p-values (suggesting that they are indeed nonzero), the coefficient for “PhD” is not “significantly” different from zero, which makes it likely that the coefficient for “PhD” is random rather than meaningful.

In more elaborate regression scenarios, you sometimes want to test more elaborate hypotheses about the data, such as “at least one of the  is non-zero” or “ equals  and  equals ,” which you can do with an F-test, which, alas, falls outside the scope of this book.

 

Regularization

In practice, you’d often like to apply linear regression to data sets with large numbers of variables. This creates a couple of extra wrinkles. First, the more variables you use, the more likely you are to overfit your model to the training set. And second, the more nonzero coefficients you have, the harder it is to make sense of them. If the goal is to explain some phenomenon, a sparse model with three factors might be more useful than a slightly better model with hundreds.

Regularization is an approach in which we add to the error term a penalty that gets larger as beta gets larger. We then minimize the combined error and penalty. The more importance we place on the penalty term, the more we discourage large coefficients.

For example, in ridge regression, we add a penalty proportional to the sum of the squares of the beta_i. (Except that typically we don’t penalize beta_0, the constant term.)

# alpha is a *hyperparameter* controlling how harsh the penalty is
# sometimes it's called "lambda" but that already means something in Python
def ridge_penalty(beta, alpha):
  return alpha * dot(beta[1:], beta[1:])
 
def squared_error_ridge(x_i, y_i, beta, alpha):
    """estimate error plus ridge penalty on beta"""
    return error(x_i, y_i, beta) ** 2 + ridge_penalty(beta, alpha)

which you can then plug into gradient descent in the usual way:

def ridge_penalty_gradient(beta, alpha):
    """gradient of just the ridge penalty"""
    return [0] + [2 * alpha * beta_j for beta_j in beta[1:]]
 
def squared_error_ridge_gradient(x_i, y_i, beta, alpha):
    """the gradient corresponding to the ith squared error term
    including the ridge penalty"""
    return vector_add(squared_error_gradient(x_i, y_i, beta),
                      ridge_penalty_gradient(beta, alpha))
 
def estimate_beta_ridge(x, y, alpha):
    """use gradient descent to fit a ridge regression
    with penalty alpha"""
    beta_initial = [random.random() for x_i in x[0]]
    return minimize_stochastic(partial(squared_error_ridge, alpha=alpha),
                               partial(squared_error_ridge_gradient,
                                       alpha=alpha),
                               x, y,
                               beta_initial,
                               0.001)

With alpha set to zero, there’s no penalty at all and we get the same results as before:

random.seed(0)
beta_0 = estimate_beta_ridge(x, daily_minutes_good, alpha=0.0)
# [30.6, 0.97, -1.87, 0.91]
dot(beta_0[1:], beta_0[1:]) # 5.26
multiple_r_squared(x, daily_minutes_good, beta_0) # 0.680

As we increase alpha, the goodness of fit gets worse, but the size of beta gets smaller:

beta_0_01 = estimate_beta_ridge(x, daily_minutes_good, alpha=0.01)
# [30.6, 0.97, -1.86, 0.89]
dot(beta_0_01[1:], beta_0_01[1:])  # 5.19
multiple_r_squared(x, daily_minutes_good, beta_0_01)  # 0.680
 
beta_0_1 = estimate_beta_ridge(x, daily_minutes_good, alpha=0.1)
# [30.8, 0.95, -1.84, 0.54]
dot(beta_0_1[1:], beta_0_1[1:])  # 4.60
multiple_r_squared(x, daily_minutes_good, beta_0_1)  # 0.680
 
beta_1 = estimate_beta_ridge(x, daily_minutes_good, alpha=1)
# [30.7, 0.90, -1.69, 0.085]
dot(beta_1[1:], beta_1[1:])  # 3.69
multiple_r_squared(x, daily_minutes_good, beta_1)  # 0.676
 
beta_10 = estimate_beta_ridge(x, daily_minutes_good, alpha=10)
# [28.3, 0.72, -0.91, -0.017]
dot(beta_10[1:], beta_10[1:])  # 1.36
multiple_r_squared(x, daily_minutes_good, beta_10)  # 0.573

In particular, the coefficient on “PhD” vanishes as we increase the penalty, which accords with our previous result that it wasn’t significantly different from zero.

NOTE

Usually you’d want to rescale your data before using this approach. After all, if you changed years of experience to centuries of experience, its least squares coefficient would increase by a factor of 100 and suddenly get penalized much more, even though it’s the same model.

Another approach is lasso regression, which uses the penalty:

def lasso_penalty(beta, alpha):
    return alpha * sum(abs(beta_i) for beta_i in beta[1:])

Whereas the ridge penalty shrank the coefficients overall, the lasso penalty tends to force coefficients to be zero, which makes it good for learning sparse models. Unfortunately, it’s not amenable to gradient descent, which means that we won’t be able to solve it from scratch.

For Further Exploration

§  Regression has a rich and expansive theory behind it. This is another place where you should consider reading a textbook or at least a lot of Wikipedia articles.

§  scikit-learn has a linear_model module that provides a LinearRegression model similar to ours, as well as Ridge regression, Lasso regression, and other types of regularization too.

§  Statsmodels is another Python module that contains (among other things) linear regression models.