Operating Missing Values in Pandas
Most of the time, we have to deal with dataframes comprised of missing values. Like other statistical and programming tools, Python has also library of Pandas to deal such type of data and to make computations smoothly. In general, missing values in Pandas dataframe are represented by NaN (Not a Number). Let’s apply some of the tricks in pandas to deal with missing values. First of all, we need to import essential libraries, and create a dataframe for demonstration.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Now create a dataframe as:
The first step is to identify the missing values. There are various methods to do this. Lets try few of them in different manner. The first method is to apply isnull(). The True represents that there exist missing value, and vice versa.
The following command sums up missing values of each columns. Total number of missing values for each column can be analysed in front of them.
Let’s consider methods to deal with these missing values. Following are methods that are considered accordingly.
- Drop missing value
- Fill Missing values
- Fill with Backward and Forward Method
- Interpolation
The first method is drop all rows with missing values. Simply apply dropna() to dataframe and its outcome can be analysed as:
Furthermore, there are two ways to delete missing values with dropna(). If we interested to delete missing values in rows. we can put axis=0 (By default 0), and if we are interested to delete missing values in columns than axis must be 1 as axis=1.
It is an easy and simple way to deal with missing values. However, dropping rows or columns can cost the loss of information. Therefore, it is not considered as an appropriate method to deal missing values. Let’s consider other methods of dealing.
Fill Missing Values
fillna() method is used to fill missing values. In this method, we can fill missing values with mean, median, mode, or any relevant value as per choice. Let’s consider a single column and fill missing values with zero.
Similarly, we can fill missing values with any number by replacing with 0. Moreover, we can also enter objects like.
Let’s compute mean, and median of the data set and then we will try to fill missing values with mean and median respectively.
The mean of each column can be analysed from above figure, and median for each column is in the below data set.
If we are interested to fill missing values with mean than we can simply add df[“column”].mean() in fillna() statement. It can also analysed from the following figure. In the same way, we can fill with median.
Fill with Backward and Forward Method
Backward and forward are another approaches to deal missing values. In backward method, we can propagate the next values backward. For that, we need to insert a statement of method=“bfill” in fillna(). In forward method, we can propagate the previous value forward. For that, we need to insert a statement of method=“ffill” in fillna(). The following figure represent with backward.
Interpolation
Interpolation is the last technique to deal with missing data in Python. Interpolation method specifically helps to estimate best fitted value in the data. Here, I will not go in detail in the interpolation process, but you can follow the link for details regarding interpolation. In the following figure, we have applied linear method of interpolation.
This method can be analysed from the following figure. Let’s compare column A of dataframe before and after interpolation method. On the left side, column A representing values before implementation of interpolation method. However, figure on the right side representing after implementation of linear interpolation. It is observed clearly that missing values are fitted with best estimated values.
Feel free to discuss…!
Good Luck!