Multilinear Regression with Python
Multilinear regression model is based on more than two independent variables. These variables used for the prediction of outcome for dependent variable.
I will try to perform multilinear regression model with Python in few simple steps. First following are few essential libraries to perform regression analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline
The next step is to import data by using Pandas library. I will import house price data and it will be used for the prediction of house prices based on various factors (independent variable). the dat can be downloaded from the link. This data is in excel file, and it can be import as:
Lets analyse the type of each variable along with number of entries (rows) by using info() function. There are seven columns and each of them are with 414 entries or rows.
In the following step, we have checked missing values with isnull().any() function. It has analysed that there exist no missing value in the data set.
Let’s perform some statistical measurements on the data to understand it. For that, describe() function has used to have following output.
Let’s analyse the relationship of variables with each other with the help of correlation. Heatmap has approached to visualised correlation of each variable with others.
From the above figure, positive correlation is computed for the variables of age of the house, distance to the nearest MRT station, and convenience stores with respect to house price of unit area.
Let’s distinguish independent and dependent variables. The dependent (Y) variable is the house price of unit area, whereas the independent variables are comprised of age of house, distance to the nearest MRT station, and convenience stores. The following command is used for this purpose.
The next step is to split data into training and test. test_size = 0.2 explain that 80% of the data is used for training and 20% for testing. After splitting data, the model is fitted on it.
Now, its a time to compute results of regression model. The first and most important term is to evaluate coefficient of determinant along with coefficient and intercept values.
Now, we can predict y values with the help of following command. In the following figure, y values are predict and comparison of actual and predicted values is extracted in the form of a dataframe.
Let’s visualise predicted and actual values in a graph. Line graph can be used for that purpose as:
Feel free to discuss…!
Good Luck!