Linear Regression
Linear regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables.
In a sample equation y = 5 + 4x:
x is a predictor independent variable
y is a predicted value
5 is a constant
4 is a coefficient value that multiplies a predictor x
Assumption for linear regression include:
there is a linear relationship between the dependent variables and the independent variables (regressors)
the error residuals are normally distributed and independent from each other
there is minimal multicollinearity between the independent variables
the variance around the regression line is the same for all values of the independent (predictor) variable
Mathematical Model
Linear regression determines a best fit of a linear function to a set of data points using a least squares approach using the elements shown below:
Python Example
To download the code below, click here.
""" linear_regression_with_numpy.py uses numpy built-in functions to perform linear regression """ # Import needed libraries. import numpy as np import matplotlib.pyplot as plotlib # Define an input vector x (one dimensional array) for the independent variables. x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) # Define an input vector y (one dimensional array) for the dependent variables. y = np.array([2, 5, 1, 6, 8, 10, 9, 8, 11, 13]) # Get the number of observation points. number_of_observations = np.size(x) # Calculate mean values of the x and y vectors. mean_x = np.mean(x) mean_y = np.mean(y) # Calculate the cross-deviation between the x and y vectors. x_y_deviation = np.sum(y * x) - (number_of_observations * mean_y * mean_x) # Calculate the deviation of the x vector. x_deviation = np.sum(x * x) - (number_of_observations * mean_x * mean_x) # Calculate least-squares regression coefficients. b_1 = x_y_deviation / x_deviation b_0 = mean_y - (b_1 * mean_x) # Print the regression coefficients. print("Regression Coefficients: " + str(b_0) + ', ' + str(b_1)) # Plot the x,y data points using the x and y vectors. plotlib.scatter(x, y, color="r", marker="o", s=30) # Calculate a y values vector for the regression line. y_values_for_regression_line = b_0 + (b_1 * x) # Plot the regression line using the vectors x and y_values_for_regression_line. plotlib.plot(x, y_values_for_regression_line, color="b") # Set the graph axis labels. plotlib.xlabel('x') plotlib.ylabel('y') # Display the graph. plotlib.show()
Results are shown below:
Regression Coefficients: (2.2, 1.1333333333333333)
Python Example using SciKit Learn
To download the code below, click here.
"""
linear_regression_with_scikit_learn.py
uses scikit-learn built-in functions to perform linear regression
"""
# Import dataset and code libraries.
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plotlib
import numpy as np
# Load the diabetes dataset.
diabetes = datasets.load_diabetes()
# Retrieve a data feature.
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Create x and y axis training and testing data.
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression model object.
regression_model = linear_model.LinearRegression()
# Train the model.
regression_model.fit(diabetes_X_train, diabetes_y_train)
# Make a regression line prediction.
diabetes_y_pred = regression_model.predict(diabetes_X_test)
# Print results.
print('Coefficients: \n', regression_model.coef_)
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot results.
plotlib.scatter(diabetes_X_test, diabetes_y_test, color='red')
plotlib.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=2)
plotlib.xlabel('x')
plotlib.ylabel('y')
plotlib.show()
Results are shown below:
Coefficients: [938.23786125]
Mean squared error: 2548.07
Variance: 0.47