书名：Mastering Machine Learning with scikit-learn（Second Edition）
作者名：Gavin Hackeling
本章字数：604字
更新时间：2025-04-04 18:47:42

Solving OLS for simple linear regression

In this section, we will work through solving OLS for simple linear regression. Recall that simple linear regression is given by the equation y = α + βx and that our goal is to solve for the values of β and α to minimize the cost function. We will solve for β first. To do so, we will calculate the variance of x and the covariance of x and y. Variance is a measure of how far a set of values are spread out. If all the numbers in the set are equal, the variance of the set is zero. A small variance indicates that the numbers are near the mean of the set, while a set containing numbers that are far from the mean and from each other will have a large variance. Variance can be calculated using the following equation:

is the mean of x, x_i is the value of x for the i^th training instance, and n is the number of training instances. Let's calculate variance of the pizza diameters in our training set:

# In[2]: 
import numpy as np

X = np.array([[6], [8], [10], [14], [18]]).reshape(-1, 1)
x_bar = X.mean()
print(x_bar)

# Note that we subtract one from the number of training instances when
  calculating the sample variance. 
# This technique is called Bessel's correction. It corrects the bias in the estimation of the population variance
# from a sample.
variance = ((X - x_bar)**2).sum() / (X.shape[0] - 1)
print(variance)

# Out[2]:
11.2
23.2

NumPy also provides the method var for calculating variance. The keyword parameter ddof can be used to set Bessel's correction to calculate the sample variance:

# In[3]:
print(np.var(X, ddof=1)) 

# Out[3]:
23.2

Covariance is a measure of how much two variables change together. If the variables increase together, their covariance is positive. If one variable tends to increase while the other decreases, their covariance is negative. If there is no linear relationship between the two variables, their covariance will be equal to zero; they are linearly uncorrelated but not necessarily independent. Covariance can be calculated using the following formula:

As with variance, x_i is the diameter of the i^th training instance, is the mean of the diameters, is the mean of the prices, y_i is the price of the i^th training instance, and n is the number of training instances. Let's calculate covariance of the diameters and prices of the pizzas in the training set:

# In[4]:
# We previously used a List to represent y.
# Here we switch to a NumPy ndarray, which provides a method to calulcate the sample mean.
y = np.array([7, 9, 13, 17.5, 18])

y_bar = y.mean()
# We transpose X because both operands must be row vectors
covariance = np.multiply((X - x_bar).transpose(), y - y_bar).sum() / 
  (X.shape[0] - 1)
print(covariance)
print(np.cov(X.transpose(), y)[0][1])

# Out[4]:
22.65
22.65

Now that we have calculated the variance of our explanatory variable and the covariance of the response and explanatory variables, we can solve for β using the following:

Having solved for β, we can solve for α using this formula:

Here, is the mean of y and is the mean of are the coordinates of the centroid, a point that the model must pass through.

Now that we have solved for the values of the model's parameters that minimize the cost function, we can plug in the diameters of the pizzas and predict their prices. For instance, an 11" pizza should be expected to cost about $12.70, and an 18" pizza should be expected to cost $19.54. Congratulations! You used simple linear regression to predict the price of a pizza.