Data Science from Scratch: First Principles with Python (2015)

Chapter 16. Logistic Regression

A lot of people say there’s a fine line between genius and insanity. I don’t think there’s a fine line, I actually think there’s a yawning gulf.

Bill Bailey

In Chapter 1, we briefly looked at the problem of trying to predict which DataSciencester users paid for premium accounts. Here we’ll revisit that problem.

The Problem

We have an anonymized data set of about 200 users, containing each user’s salary, her years of experience as a data scientist, and whether she paid for a premium account (Figure 16-1). As is usual with categorical variables, we represent the dependent variable as either 0 (no premium account) or 1 (premium account).

As usual, our data is in a matrix where each row is a list [experience, salary, paid_account]. Let’s turn it into the format we need:

x = [[1] + row[:2] for row in data]  # each element is [1, experience, salary]
y = [row[2] for row in data]         # each element is paid_account

An obvious first attempt is to use linear regression and find the best model:

Paid and Unpaid Users.

 

Figure 16-1. Paid and unpaid users

And certainly, there’s nothing preventing us from modeling the problem this way. The results are shown in Figure 16-2:

rescaled_x = rescale(x)
beta = estimate_beta(rescaled_x, y)  # [0.26, 0.43, -0.43]
predictions = [predict(x_i, beta) for x_i in rescaled_x]
 
plt.scatter(predictions, y)
plt.xlabel("predicted")
plt.ylabel("actual")
plt.show()

Using Linear Regression to Predict Premium Accounts.

Figure 16-2. Using linear regression to predict premium accounts

But this approach leads to a couple of immediate problems:

§  We’d like for our predicted outputs to be 0 or 1, to indicate class membership. It’s fine if they’re between 0 and 1, since we can interpret these as probabilities — an output of 0.25 could mean 25% chance of being a paid member. But the outputs of the linear model can be huge positive numbers or even negative numbers, which it’s not clear how to interpret. Indeed, here a lot of our predictions were negative.

§  The linear regression model assumed that the errors were uncorrelated with the columns of x. But here, the regression coefficent for experience is 0.43, indicating that more experience leads to a greater likelihood of a premium account. This means that our model outputs very large values for people with lots of experience. But we know that the actual values must be at most 1, which means that necessarily very large outputs (and therefore very large values of experience) correspond to very large negative values of the error term. Because this is the case, our estimate ofbeta is biased.

What we’d like instead is for large positive values of dot(x_i, beta) to correspond to probabilities close to 1, and for large negative values to correspond to probabilities close to 0. We can accomplish this by applying another function to the result.

The Logistic Function

In the case of logistic regression, we use the logistic function, pictured in Figure 16-3:

def logistic(x):
    return 1.0 / (1 + math.exp(-x))

Logistic function.

Figure 16-3. The logistic function

As its input gets large and positive, it gets closer and closer to 1. As its input gets large and negative, it gets closer and closer to 0. Additionally, it has the convenient property that its derivative is given by:

def logistic_prime(x):
    return logistic(x) * (1 - logistic(x))

which we’ll make use of in a bit. We’ll use this to fit a model:

where f is the logistic function.

Recall that for linear regression we fit the model by minimizing the sum of squared errors, which ended up choosing the  that maximized the likelihood of the data.

Here the two aren’t equivalent, so we’ll use gradient descent to maximize the likelihood directly. This means we need to calculate the likelihood function and its gradient.

Given some , our model says that each  should equal 1 with probability  and 0 with probability .

In particular, the pdf for  can be written as:

since if  is 0, this equals:

and if  is 1, it equals:

It turns out that it’s actually simpler to maximize the log likelihood:

Because log is strictly increasing function, any beta that maximizes the log likelihood also maximizes the likelihood, and vice versa.

def logistic_log_likelihood_i(x_i, y_i, beta):
    if y_i == 1:
        return math.log(logistic(dot(x_i, beta)))
    else:
        return math.log(1 - logistic(dot(x_i, beta)))

If we assume different data points are independent from one another, the overall likelihood is just the product of the individual likelihoods. Which means the overall log likelihood is the sum of the individual log likelihoods:

def logistic_log_likelihood(x, y, beta):
    return sum(logistic_log_likelihood_i(x_i, y_i, beta)
               for x_i, y_i in zip(x, y))

A little bit of calculus gives us the gradient:

def logistic_log_partial_ij(x_i, y_i, beta, j):
    """here i is the index of the data point,
    j the index of the derivative"""
 
    return (y_i - logistic(dot(x_i, beta))) * x_i[j]
 
def logistic_log_gradient_i(x_i, y_i, beta):
    """the gradient of the log likelihood
    corresponding to the ith data point"""
 
    return [logistic_log_partial_ij(x_i, y_i, beta, j)
            for j, _ in enumerate(beta)]
 
def logistic_log_gradient(x, y, beta):
    return reduce(vector_add,
                  [logistic_log_gradient_i(x_i, y_i, beta)
                   for x_i, y_i in zip(x,y)])

at which point we have all the pieces we need.

Applying the Model

We’ll want to split our data into a training set and a test set:

random.seed(0)
x_train, x_test, y_train, y_test = train_test_split(rescaled_x, y, 0.33)
 
# want to maximize log likelihood on the training data
fn = partial(logistic_log_likelihood, x_train, y_train)
gradient_fn = partial(logistic_log_gradient, x_train, y_train)
 
# pick a random starting point
beta_0 = [random.random() for _ in range(3)]
 
# and maximize using gradient descent
beta_hat = maximize_batch(fn, gradient_fn, beta_0)

Alternatively, you could use stochastic gradient descent:

beta_hat = maximize_stochastic(logistic_log_likelihood_i,
                               logistic_log_gradient_i,
                               x_train, y_train, beta_0)

Either way we find approximately:

beta_hat = [-1.90, 4.05, -3.87]

These are coefficients for the rescaled data, but we can transform them back to the original data as well:

beta_hat_unscaled = [7.61, 1.42, -0.000249]

Unfortunately, these are not as easy to interpret as linear regression coefficients. All else being equal, an extra year of experience adds 1.42 to the input of logistic. All else being equal, an extra $10,000 of salary subtracts 2.49 from the input of logistic.

The impact on the output, however, depends on the other inputs as well. If dot(beta, x_i) is already large (corresponding to a probability close to 1), increasing it even by a lot cannot affect the probability very much. If it’s close to 0, increasing it just a little might increase the probability quite a bit.

What we can say is that — all else being equal — people with more experience are more likely to pay for accounts. And that — all else being equal — people with higher salaries are less likely to pay for accounts. (This was also somewhat apparent when we plotted the data.)

Goodness of Fit

We haven’t yet used the test data that we held out. Let’s see what happens if we predict paid account whenever the probability exceeds 0.5:

true_positives = false_positives = true_negatives = false_negatives = 0
 
for x_i, y_i in zip(x_test, y_test):
    predict = logistic(dot(beta_hat, x_i))
 
    if y_i == 1 and predict >= 0.5:  # TP: paid and we predict paid
        true_positives += 1
    elif y_i == 1:                   # FN: paid and we predict unpaid
        false_negatives += 1
    elif predict >= 0.5:             # FP: unpaid and we predict paid
        false_positives += 1
    else:                            # TN: unpaid and we predict unpaid
        true_negatives += 1
 
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)

This gives a precision of 93% (“when we predict paid account we’re right 93% of the time”) and a recall of 82% (“when a user has a paid account we predict paid account 82% of the time”), both of which are pretty respectable numbers.

We can also plot the predictions versus the actuals (Figure 16-4), which also shows that the model performs well:

predictions = [logistic(dot(beta_hat, x_i)) for x_i in x_test]
plt.scatter(predictions, y_test)
plt.xlabel("predicted probability")
plt.ylabel("actual outcome")
plt.title("Logistic Regression Predicted vs. Actual")
plt.show()

Logistic Regression Predicted vs Actual.

Figure 16-4. Logistic regression predicted versus actual

 

Support Vector Machines

The set of points where dot(beta_hat, x_i) equals 0 is the boundary between our classes. We can plot this to see exactly what our model is doing (Figure 16-5).

This boundary is a hyperplane that splits the parameter space into two half-spaces corresponding to predict paid and predict unpaid. We found it as a side-effect of finding the most likely logistic model.

An alternative approach to classification is to just look for the hyperplane that “best” separates the classes in the training data. This is the idea behind the support vector machine, which finds the hyperplane that maximizes the distance to the nearest point in each class (Figure 16-6).

Paid and Unpaid Users With Decision Boundary.

Figure 16-5. Paid and unpaid users with decision boundary

Finding such a hyperplane is an optimization problem that involves techniques that are too advanced for us. A different problem is that a separating hyperplane might not exist at all. In our “who pays?” data set there simply is no line that perfectly separates the paid users from the unpaid users.

We can (sometimes) get around this by transforming the data into a higher-dimensional space. For example, consider the simple one-dimensional data set shown in Figure 16-7.

A Separating Hyperplane

Figure 16-6. A separating hyperplane

It’s clear that there’s no hyperplane that separates the positive examples from the negative ones. However, look at what happens when we map this data set to two dimensions by sending the point x to (x, x**2). Suddenly it’s possible to find a hyperplane that splits the data (Figure 16-8).

This is usually called the kernel trick because rather than actually mapping the points into the higher-dimensional space (which could be expensive if there are a lot of points and the mapping is complicated), we can use a “kernel” function to compute dot products in the higher-dimensional space and use those to find a hyperplane.

A Non-Separable One-Dimensional Data set

Figure 16-7. A nonseparable one-dimensional data set

It’s hard (and probably not a good idea) to use support vector machines without relying on specialized optimization software written by people with the appropriate expertise, so we’ll have to leave our treatment here.

Becomes Separable In Higher Dimensions

Figure 16-8. Data set becomes separable in higher dimensions

 

For Further Investigation

§  scikit-learn has modules for both Logistic Regression and Support Vector Machines.

§  libsvm is the support vector machine implementation that scikit-learn is using behind the scenes. Its website has a variety of useful documentation about support vector machines.