What is Linear Regression and Using scikit-learn

Introduction

Linear regression is a fundamental concept in machine learning and data analysis. It is a statistical technique that models the relationship between a dependent variable and one or more independent variables. The scikit-learn library in Python provides a powerful tool, called sklearn, for implementing linear regression models. In this article, we will explore how to perform linear regression using sklearn and analyze the results.

What is Linear Regression?

Linear regression is a simple yet effective technique used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables and attempts to draw a straight line that best fits the data points. The goal of linear regression is to minimize the residual sum of squares, which represents the difference between the observed responses in the dataset and the responses predicted by the linear approximation.

Importing the Required Libraries

Before we dive into the implementation of linear regression using sklearn, let’s start by importing the necessary libraries.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

Here, we import matplotlib.pyplot for plotting the results, numpy for numerical computations, datasets and linear_model from sklearn for loading and training the linear regression model, and mean_squared_error and r2_score for evaluating the model’s performance.

Loading the Dataset

Next, we load the diabetes dataset provided by sklearn using the datasets.load_diabetes() function. The diabetes dataset is a widely used dataset for regression analysis, consisting of several features related to diabetes progression.

diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

In this example, we will use only one feature from the diabetes dataset to illustrate the data points within a two-dimensional plot. We select the third feature by indexing diabetes_X using diabetes_X[:, np.newaxis, 2].

Splitting the Data

To evaluate the performance of our linear regression model, we need to split the dataset into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance on unseen data.

diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

In this example, we split the dataset such that the last 20 samples are used as the testing set, and the remaining samples are used as the training set.

Creating the Linear Regression Model

Now that we have our data ready, we can create an instance of the linear regression model using the linear_model.LinearRegression() class.

regr = linear_model.LinearRegression()

This creates a linear regression object called regr.

Training the Model

To train our linear regression model, we need to call the fit() method and pass in the training data.

regr.fit(diabetes_X_train, diabetes_y_train)

The fit() method fits the model to the training data, adjusting the model’s coefficients to minimize the residual sum of squares.

Making Predictions

Once we have trained our linear regression model, we can use it to make predictions on the testing set. We call the predict() method and pass in the testing data.

diabetes_y_pred = regr.predict(diabetes_X_test)

The predict() method returns the predicted responses for the testing set based on the trained model.

Evaluating the Model

Now that we have our predicted responses, we can evaluate the performance of our linear regression model. We calculate the mean squared error and the coefficient of determination.

mean_squared_error(diabetes_y_test, diabetes_y_pred)
r2_score(diabetes_y_test, diabetes_y_pred)

The mean squared error measures the average squared difference between the actual and predicted responses. A lower mean squared error indicates a better fit. The coefficient of determination, also known as R-squared, measures the proportion of the variance in the dependent variable that can be explained by the independent variable. A value of 1 indicates a perfect prediction.

Visualizing the Results

To better understand the performance of our linear regression model, we can plot the predicted responses against the actual responses using matplotlib.pyplot.

plt.scatter(diabetes_X_test, diabetes_y_test, color="black")
plt.plot(diabetes_X_test, diabetes_y_pred, color="blue", linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

The scatter plot shows the actual responses as black dots, while the blue line represents the predicted responses. The closer the points align to the line, the better the model’s prediction.

Conclusion

In this article, we learned how to perform linear regression using sklearn in Python. We loaded the diabetes dataset, split it into training and testing sets, created a linear regression model, trained the model, made predictions, and evaluated its performance. We also visualized the results using a scatter plot. Linear regression is a powerful technique for modeling the relationship between variables and can be applied to a wide range of real-world problems. By using sklearn, we can easily implement and analyze linear regression models in Python.

Are you interested in AI but don’t know where to start? Want to understand the role of an AI Architect? Check out our page and watch our informative video.

Learn More About Our AI Services