top of page
Search

How data cleaning can get you your heart's desire

Updated: Apr 17, 2020

Data cleaning is a process of transforming raw data into statistically correct data which can be further analyze to get a proper,sensible and valuable information, Often the data given to data analyst have many errors and without correcting those error the analysis could not be accurate and thus the data cleaning process is one of the most crucial process in data analysis.

There are many tools and technology have been introduced since the data analysis has been came into market but from all of them for R seems to be much easier and accurate. with the start of 2020 i had the opportunity to learn R and it felt so easy as if i know it from when i was born. But i think i already know SQL queries and different languages my brain is kind of used to it. but i can say that any new comer to data analysis can learn it faster than any languages. putting all this aside.

Data cleaning is the third step in the process of data analysis

Step 1: Defining the need of data analysis.

Step 2: Collecting the data from sources.

Step 3: Cleaning Data.

Step 4: Analyzing the Data.

Step 5: Interpret data and analyze them.

So, now we know the purpose of data cleaning lets dive into the the basic steps of cleaning data in R

Step 1: Initial Exploratory analysis

Step 2: Visual Exploratory analysis

Step 3: Cleaning the errors

Step 1: Initial Exploratory analysis

The first step in data cleaning is initial Exploratory analysis where we see the imported is in dataframe format or not if not them make it in dataframe. It is very important to know how you can import data in R.

Here, I have taken a Supermarket data-set which consist of 50 rows and 20 columns and different numerical,integers values and some errors which need to be corrected and cleaning the data from errors,

You can download a data-set for your practice from kaggle and many other dataset providing official sites.


#setting the directory where your raw data is there
setwd("F:/Blog/Cleaning data using R") 
#viewing the directory 
dir()
#read data in directory which is in csv format
data<-read.csv("superstore.csv") 
#viewing the data
View(data)

output:

After importing the data first thing we should check is the class of our data frame

class(data)

with this syntax we can check if the our data is saved in data frame format or not.

[1] "data.frame"

Now, we could check how many rows and column does the data frame has by using following syntax.

dim(data)
[1] 49 21

Here, we can see that our data frame has 49 rows and 21 columns.

Now, to check the summary statistics of our data we can use the following

summary(data)

Output:


Now, as we have seen that our data is in proper data frame format with proper rows and columns and we have the summary record now lets move to the second step that is visual exploratory analysis.

Step 2: Visual Exploratory analysis

There are two types of plots we should use during cleaning process

1. Histogram

2. BoxPlot


1. Histogram

It is very useful in visualizing the distribution of numeric columns. We can determine weather the distribution of data is normal, binomial modal, uniform modal or any other kind of distribution. Also we can use histogram to determine if there is an outlier in the particular numerical column, we can draw a histogram of particular numerical column by using following snippet.


install.packages("plyr")
library(plyr)

first we will install a package name "plyr" for drawing histogram of our numerical column


hist(data$Sales)

Similarly, we can check for other numerical columns in our data frames with histograms

Output:

2. BoxPlot

This plot comes in great use because it shows you the median with 1st, 2nd, 3rd quartiles. Boxplot is best way for spotting the outliers in our data frame. We can draw a boxplot by using following snippet.


boxplot(data$Sales)

Similarly, for other numerical columns in our data frame

Output:

Now, As we have done checking for distributions in our data and outliers we will move toward the correction in errors of our data frame. which let us to our final step of data cleaning i.e cleaning the errors in our data frames.

Step 3: Correcting the errors

Before starting with error correction first we need to understand what kind of impurities does the data has after that only we can start with the methods to one by one cleaning those impurities.


There are numerous mistakes done while taking the data from sensors or other sources where the column name is not specified or somewhere the data type is not integer or character or even missing values and many more following are some of them.

1. Column names are incorrect

2. Columns have incorrect type associated (column containing text element stored as numeric column)

3. values can be in uppercase or lowercase

4. wrong whitespaces

5. wrong values

6. Outliers

7. Missing values and many more


Now, we know that what kind of errors we can find in data frame we will one by one try to resolve those errors in a data frame.


1. Now, my data has a column which has integer element stored in a character format lets make the correction here.

class(data$Row.ID)
[1] "character"

After correction:

data$Row.ID<-as.numeric(data$Row.ID)
[1] "numeric"

Output:

Similarly, we can change any other incorrect type associated with this snippet.

There are many type conversion we can carry out in R such as as.numeric(), as.character(), as.factor().


2. Now lets change the elements of lowercase columns into uppercase.

In our data frame customer name is in lower case so lets make it into upper case.


View(data$Customer.Name)
data$Customer.Name<-toupper(data$Customer.Name)

Similarly, We can change all the other lowercase to uppercase or vice versa.


3. Now lets trim all the whitespaces in the specific columns in our dataset.

before trimming the whitespaces we need to install a package name "stringr"


install.packages("stringr")
library(stringr)

we are removing whitespaces from our country column using following snippet


data$Country<-str_trim(data$Country)

4. Now we will replace our data ship.mode column standard class into general class which was a mistake in entry.


data$Ship.Mode<-str_replace(data$Ship.Mode,"Standard Class","General Class")
View(data$Ship.Mode)

Output:

Similarly, we can replace any other elements in the data frames.


5. Replacing the outliers of a particular column with median.


vec1<-boxplot.stats(data$Profit)$out;
data$Sales[data$Sales %in% vec1]<-median(data$Sales)

Here, we took sales column and replaced the median

Output:



6. Now, Lets deal with the missing values in our data frames.

we will check for missing values in data frame by following snippet.

any(is.na(data)) #Checking missing value in entire data frame
[1] True
sum(is.na(data)) #Checking for total number of missing value in data frame
[1] 1
sum(is.na(data$Discount)) #Checking missing value in particular column
[1] 1
na.omit(data) #Eliminating missing value from entire data frame
[1] omitted 1 rows
data[is.na(data)]<- 0  #Replacing NA with 0 from entire data frame

Output:

7. Our data frame has two unnecessary column which we can merge together for data efficiency.

but before doing so, we have to install a package name "tidyr"


install.packages("tidyr")
library(tidyr)

data1<-unite(data = data,col = Categorywithsub.category,Category,Sub.Category)

View(data1)

Output:

The unite() function takes 4 arguments 1. The Data frame 2. the new column name, the first column and the second column name that we want to unite.

Steps 1 to 3 will give you a clean data-set . Always explore the new and efficient ways to clean the data and never ever stop exploring, Thank you😁


4 comments

Recent Posts

See All
Favorite Links
Recent posts
bottom of page