< prev | next >

Linear Regression

Linear regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables.

In a sample equation y = 5 + 4x:

  • x is a predictor independent variable

  • y is a predicted value

  • 5 is a constant

  • 4 is a coefficient value that multiplies a predictor x

Assumption for linear regression include:

  • there is a linear relationship between the dependent variables and the independent variables (regressors)

  • the error residuals are normally distributed and independent from each other

  • there is minimal multicollinearity between the independent variables

  • the variance around the regression line is the same for all values of the independent (predictor) variable

Mathematical Model

Linear regression determines a best fit of a linear function to a set of data points using a least squares approach using the elements shown below:

Python Example

To download the code below, click here.

"""
linear_regression_with_numpy.py
uses numpy built-in functions to perform linear regression
"""

# Import needed libraries.
import numpy as np
import matplotlib.pyplot as plotlib

# Define an input vector x (one dimensional array) for the independent variables.
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Define an input vector y (one dimensional array) for the dependent variables.
y = np.array([2, 5, 1, 6, 8, 10, 9, 8, 11, 13])

# Get the number of observation points.
number_of_observations = np.size(x)

# Calculate mean values of the x and y vectors.
mean_x = np.mean(x)
mean_y = np.mean(y)

# Calculate the cross-deviation between the x and y vectors.
x_y_deviation = np.sum(y * x) - (number_of_observations * mean_y * mean_x)

# Calculate the deviation of the x vector.
x_deviation = np.sum(x * x) - (number_of_observations * mean_x * mean_x)

# Calculate least-squares regression coefficients.
b_1 = x_y_deviation / x_deviation
b_0 = mean_y - (b_1 * mean_x)

# Print the regression coefficients.
print("Regression Coefficients: " + str(b_0) + ', ' + str(b_1))

# Plot the x,y data points using the x and y vectors.
plotlib.scatter(x, y, color="r", marker="o", s=30)

# Calculate a y values vector for the regression line.
y_values_for_regression_line = b_0 + (b_1 * x)

# Plot the regression line using the vectors x and y_values_for_regression_line.
plotlib.plot(x, y_values_for_regression_line, color="b")

# Set the graph axis labels.
plotlib.xlabel('x')
plotlib.ylabel('y')

# Display the graph.
plotlib.show()

Results are shown below:

Regression Coefficients: (2.2, 1.1333333333333333)

Python Example using SciKit Learn

To download the code below, click here.

"""
linear_regression_with_scikit_learn.py
uses scikit-learn built-in functions to perform linear regression
"""

# Import dataset and code libraries.
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plotlib
import numpy as np

# Load the diabetes dataset.
diabetes = datasets.load_diabetes()

# Retrieve a data feature.
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Create x and y axis training and testing data.
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression model object.
regression_model = linear_model.LinearRegression()

# Train the model.
regression_model.fit(diabetes_X_train, diabetes_y_train)

# Make a regression line prediction.
diabetes_y_pred = regression_model.predict(diabetes_X_test)

# Print results.
print('Coefficients: \n', regression_model.coef_)
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

# Plot results.
plotlib.scatter(diabetes_X_test, diabetes_y_test, color='red')
plotlib.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=2)
plotlib.xlabel('x')
plotlib.ylabel('y')
plotlib.show()


Results are shown below: