Linear regression is a fundamental concept in machine learning and data analysis. It is a statistical technique that models the relationship between a dependent variable and one or more independent variables. The scikit-learn library in Python provides a powerful tool, called
sklearn, for implementing linear regression models. In this article, we will explore how to perform linear regression using
sklearn and analyze the results.
What is Linear Regression?
Linear regression is a simple yet effective technique used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables and attempts to draw a straight line that best fits the data points. The goal of linear regression is to minimize the residual sum of squares, which represents the difference between the observed responses in the dataset and the responses predicted by the linear approximation.
Importing the Required Libraries
Before we dive into the implementation of linear regression using
sklearn, let’s start by importing the necessary libraries.
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
Here, we import
matplotlib.pyplot for plotting the results,
numpy for numerical computations,
sklearn for loading and training the linear regression model, and
r2_score for evaluating the model’s performance.
Loading the Dataset
Next, we load the diabetes dataset provided by
sklearn using the
datasets.load_diabetes() function. The diabetes dataset is a widely used dataset for regression analysis, consisting of several features related to diabetes progression.
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
In this example, we will use only one feature from the diabetes dataset to illustrate the data points within a two-dimensional plot. We select the third feature by indexing
diabetes_X[:, np.newaxis, 2].
Splitting the Data
To evaluate the performance of our linear regression model, we need to split the dataset into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance on unseen data.
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
In this example, we split the dataset such that the last 20 samples are used as the testing set, and the remaining samples are used as the training set.
Creating the Linear Regression Model
Now that we have our data ready, we can create an instance of the linear regression model using the
regr = linear_model.LinearRegression()
This creates a linear regression object called
Training the Model
To train our linear regression model, we need to call the
fit() method and pass in the training data.
fit() method fits the model to the training data, adjusting the model’s coefficients to minimize the residual sum of squares.
Once we have trained our linear regression model, we can use it to make predictions on the testing set. We call the
predict() method and pass in the testing data.
diabetes_y_pred = regr.predict(diabetes_X_test)
predict() method returns the predicted responses for the testing set based on the trained model.
Evaluating the Model
Now that we have our predicted responses, we can evaluate the performance of our linear regression model. We calculate the mean squared error and the coefficient of determination.
The mean squared error measures the average squared difference between the actual and predicted responses. A lower mean squared error indicates a better fit. The coefficient of determination, also known as R-squared, measures the proportion of the variance in the dependent variable that can be explained by the independent variable. A value of 1 indicates a perfect prediction.
Visualizing the Results
To better understand the performance of our linear regression model, we can plot the predicted responses against the actual responses using
plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)
The scatter plot shows the actual responses as black dots, while the blue line represents the predicted responses. The closer the points align to the line, the better the model’s prediction.
In this article, we learned how to perform linear regression using
sklearn in Python. We loaded the diabetes dataset, split it into training and testing sets, created a linear regression model, trained the model, made predictions, and evaluated its performance. We also visualized the results using a scatter plot. Linear regression is a powerful technique for modeling the relationship between variables and can be applied to a wide range of real-world problems. By using
sklearn, we can easily implement and analyze linear regression models in Python.