Linear Regression
- Quick Analysis of Data Set
- Data Wrangling
- Exploratory Data Analysis
- Linear Model Creation
- Model Visualization
- Model Accuracy
- Model Evaluation : Train and Test Data, Cross Validation
Linear Regression is one of the common and popular algorithm in Machine Learning. Infact, typically this would be the first algorithm that you will encounter while learning Machine Learning.
Linear regression as the name suggests is a model that assumes a linear relationship between independent variable (x) and the dependent or target variable (y). There are two types on Linear Regression Model :-
Simple Linear Regression : Linear model between one independent variable (x) and target variable (y).
Multiple Linear Regression : Linear model between two or more independent variables (x1, x2 ...) and target variable (y).
Linear regression model would predict the output (dependent variable) as a function of independent variable.
Y = a + b X
where, Y : Predictor Variable X : Independent Variable a : Intercept of regression line (i.e. value of Y when X=0) b : Slope of regression line (i.e. rate of change of Y when X is incremented by 1 unit)
For Multiple Linear Regression, equation would be
Y = a + b1X1 + b2X2 + b3X3 + ......
Let's start with Linear Regression using Python. For this tutorial, I'll be using Boston Housing Price Data Set, which is provided in SciKit Learn library SciKit Link. Details about this data set can be found at Link
First import all the required libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.datasets import load_boston
We will load the dataset using load_boston method, this will return a Bunch object. Bunch object is like a Python dictionary, let's quickly look at the significance of keys :-
data : Data to learn
target : Target column, for this example Median value of owner occupied home will be the target
DESCR : Description of dataset including the headers significance
feature_names : Names of header
filename : Physical location from where the data was created
boston = load_boston()
print("Shape for Data is", boston.data.shape)
print("Shape for Target is", boston.target.shape)
print("Names of header :\n", boston.feature_names)
print("Dataset description :\n", boston.DESCR)
# Create a Pandas Data Frame from this data set
df = pd.DataFrame(boston.data)
df.head(2)
As you can see that the data frame has no header names and is also missing our target column. Let's fix this first :-
df.columns = boston.feature_names
df['MEDV'] = boston.target
df.head(2)
df.info()
If we check our data, it doesn't seems to have any missing value. So, we will assume this as cleaned data and will use as such.
df.isnull().sum()
# Descriptive Statistics
df.describe()
Let's use some visualization techniques on our data to understand the distribution and relationship of independent variables with our target variable.
For distribution of data, we will use Histogram. Before proceeding, let's look at what Histogram is :-
A histogram is a type of bar graph that shows the frequency or number of values compared to a set of value ranges. To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to be) of equal size.
Let's use histogram to view the data distribution in our case. Look closely at our target (MEDV) and notice that the prices are mainly in the mid range distribution.
df.hist(bins=30, figsize=(18,18))
To visualize the relationship between independent variables and target variable (MEDV), we will use Regression Plot, which is basically a Scatter Plot with fitted regression line. We use regression plot for variables with continous values, while a box plot will be used for categorical variables.
plt.figure(figsize=(25, 25))
for i, col in enumerate(boston.feature_names):
plt.subplot(4, 4 , i+1)
sns.regplot(x=col, y='MEDV', data=df)
plt.title("%s vs MEDV" %(col))
Observations so far :-
- Variable 'RM' seems to have positive linear relationship with target variable 'MEDV'.
- Variable 'LSTAT' seems to have negative linear relationship with target variable 'MEDV'.
Let's use other statistical methods to confirm our observations. We will use Pearson Corelation as our statistical tool.
What is Pearson Corelation?
A Pearson Corelation is a number between -1 and 1 which indicates the extent to which two variables are linearly related.
Corelation Cofficient:
1 - Positively correlated -1 - Negatively correlated. 0 - Not correlated.
P-value:
The P-value is the probability value that the correlation between these two variables is statistically significant.
By convention, when the
p-value is < 0.001: we say there is strong evidence that the correlation is significant.
the p-value is < 0.05: there is moderate evidence that the correlation is significant.
the p-value is < 0.1: there is weak evidence that the correlation is significant.
the p-value is > 0.1: there is no evidence that the correlation is significant.
We will use the Pandas data frame method "corr()" to find the co-relation and then use the Seaborn's Heatmap to plot it.
sns.set(rc={'figure.figsize':(8,8)})
sns.heatmap(df.corr().round(2), square=True, cmap='RdYlGn', annot=True)
We observed visually that attributes 'RM' exhibits postive linear relationship, which seems correct as per Corelation Cofficient with a value of 0.7. We also observed that 'LSTAT' exhibits negative linear relationship with 'MEDV' and this seems to be correct statistically as well with a cofficient value of -0.74.
Let's also look at the p-value as well.
pearson_coef1, p_value1 = stats.pearsonr(df['RM'], df['MEDV'])
print("The Pearson Correlation Coefficient for RM is", pearson_coef1, " with a P-value of P = ", p_value1)
pearson_coef2, p_value2 = stats.pearsonr(df['LSTAT'], df['MEDV'])
print("The Pearson Correlation Coefficient for LSTAT is", pearson_coef2, " with a P-value of P = ", p_value2)
print("Is there strong evidence if corelation is signifcant for RM :- ", p_value1 < 0.001)
print("Is there strong evidence if corelation is signifcant for LSTAT :- ", p_value2 < 0.001)
As you can notice, our corelation is significant for variables 'RM' and 'LSTAT'.
Linear Model Creation
Now that we have indentified few variables which appears to have linear relationship with our target data, we will start developing the Linear Regression Model.
We will start with Simple Linear Regression. As stated above, Simple Linear Regression only uses one variable, so we will use only 'RM' attribute.
First, we will import the LinearRegression class from the SciKit library and create a LinearRegression object. Then we will train the model, find the slope and intercept and finally we will do some prediction.
NOTE : For this post as part of explaining linear regression model, I will be using the entire data set for training. In real world, we will split the data into test and training data. This will be covered in a later post.
from sklearn.linear_model import LinearRegression
# Create Linear Regression Object
lm1 = LinearRegression()
X1 = df[['RM']]
Y1 = df[['MEDV']] # Target
# Fit (Train) the model
lm1.fit(X1,Y1)
print("Intercept for the model is", lm1.intercept_, "and the scope is",lm1.coef_)
# Prediction
Yout1 = lm1.predict(X1)
# Actual and Predicted values (first five)
print("Predicted Values:",Yout1[0:5])
print("Actual Values:",Y1.values[0:5])
Now, we will develop the Multiple Linear Regression with two variables - 'RM' and 'LSTAT'.
lm2 = LinearRegression()
X2 = df[['RM', 'LSTAT']]
Y2 = df[['MEDV']]
# Fit (Train) the model
lm2.fit(X2,Y2)
print("Intercept for the model is", lm2.intercept_, "and the scope is",lm2.coef_)
# Prediction
Yout2 = lm2.predict(X2)
# Actual and Predicted values (first five)
print("Predicted Values:",Yout2[0:5])
print("Actual Values:",Y2.values[0:5])
We will visualize our models that we have created. For Simple Linear Regression, we can either use Regression Plot or Residual Plot. In case of regression plot, we need to focus on spread of data from regression line. If the data is too far from regression line, then we can say that Linear Model is not the best fit.
Since we have already used Regression Plot, we will use Residual Plot. A Residual is the difference between observed value (Y) and predicted value (Yout). In residual plot, X-axis will have independent variable and Y-Axis will have residuals.
What to look in residual plot?
- If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.
# Seaborn library to be used for Residual Plot
plt.figure(figsize=(6,6))
sns.residplot(df['RM'],df['MEDV'])
plt.show()
For Multiple Linear Regression, we can't use Regression or Residual Plot because of multiple independent variables, so we will use Distribution Plot.
plt.figure(figsize=(6,6))
ax1 = sns.distplot(df['MEDV'], hist=False, color="r", label="Actual")
sns.distplot(Yout2, hist=False, color="b", label="Fitted", ax=ax1)
Model Accuracy
We will use the following quantitative methods for finding model accuracy.
Mean Squared Error (MSE)
The Mean Squared Error measures the average of the squares of errors. It is calculated by :-
- Finding the error between actual and predicted value
- Taking a square of it
- Sum up all the values
- Divide by number of values
R-squared
R squared, also known as the coefficient of determination is a measure to indicate how close the data is to the fitted regression line.
R^2 = 1 - MSE of regression line / MSE of average of data
Which one is better model?
- Model will less MSE and high R-Square value
We can use the sklearn.metrics.mean_squared_error to find the Mean Squared Error and model score() method to obtain R-Square value.
from sklearn.metrics import mean_squared_error
# Simple Linear Regression
mse1 = mean_squared_error(Y1,Yout1)
print("Mean square error for simple linear regression is",mse1)
print("R-Square value for simple linear regression is", lm1.score(X1,Y1))
print("\n")
# Multiple Linear Regression
mse2 = mean_squared_error(Y2,Yout2)
print("Mean square error for mulitple linear regression is",mse2)
print("R-Square value for multiple linear regression is", lm2.score(X2,Y2))
We can see that the multiple linear regression model seems to perform better because of high R-Square value and low mean square error.
Model Evaluation : Train and Test Data, Cross Validation
As I mentioned before, in real-world we will split our data into test and training data to evaluate our model. We will use "train_test_split" method to split our data into testing and training data. In this section, I will be using Multiple Regression Model to illustrate this process.
# First step that we will take is to seperate target data
y_data = df['MEDV']
x_data = df.drop('MEDV',axis=1)
from sklearn.model_selection import train_test_split
# Split the data into test and training (15% as test data)
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=10)
print("Number of test samples :", x_test.shape[0])
print("Number of training samples:",x_train.shape[0])
lm = LinearRegression()
# Fit (Train) the model using the training data
lm.fit(x_train[['RM','LSTAT']],y_train)
# Prediction using Training Data
yout_train = lm.predict(x_train[['RM','LSTAT']])
print(yout_train[0:5])
# Prediction using Test Data
yout_test = lm.predict(x_test[['RM','LSTAT']])
print(yout_test[0:5])
print("\n")
# Model accuracy using Test Data
mse_test = mean_squared_error(y_test,yout_test)
print("Mean square error is",mse_test)
print("R-Square value using test data is", lm.score(x_test[['RM','LSTAT']],y_test))
print("\n")
# Model accuracy using Training Data
mse_train = mean_squared_error(y_train,yout_train)
print("Mean square error is",mse_train)
print("R-Square value using training data is", lm.score(x_train[['RM','LSTAT']],y_train))
Cross Validation :
Division of data into testing and training data may result into a problem especially if the data is sorted by some particular features and our testing data does not have that data. With Cross Validation, we will use entire subset of data for training and testing in iterations. This is just a brief introduction on this topic and there are other methods to perform cross validation.
In the example below, we are using K-Fold validation technique and dividing our data into 5 folds and each of the fold will be used as a test data. We will then take the mean of all the iterations.
from sklearn.model_selection import cross_val_score, KFold
rcross = cross_val_score(lm, x_data, y_data, cv=KFold(n_splits=5,shuffle=True))
print(rcross)
print("The mean of the folds are", rcross.mean())