You should learn this to become a good data scientist

Mohammed Zeeshan Jagirdar
Apr 12, 2020
6 min read

Updated: Apr 15, 2020

In this article you will learn how to use R programming for exploring data using different functions like summary(), head() etc. Exploratory data analysis is one of the most important initial step after data cleaning to understand the data you are working on. EDA will help you find out errors, patterns etc in your data. According to jeann joroge in his article "significance of exploratory data analysis" as long as there is data to analyze the need to explore is obvious.

So, in this article we are going to explore how data is explored😁😁.

After this article you will be able to understand the following concepts:

1. What is Exploratory Data Analysis.

2. Benefits of using Exploratory data analysis.

3. Working on the data.

4. Gathering information using different functions in R

Before getting started following are the articles which you will be interested to read.

1. New to R programming.

2. Data cleaning in R programming.

3. Data visualization in R programming.

4. Prediction model in R programming.

5. Aggregate and merge functions in R programming.

1. What is Exploratory Data Analysis.

EDA is a technique used for describing the data by means of statistical and visualization technique to bring in important aspect of data into focus for other analysis. This involves describing your data summarizing it without any assumptions of the content. This is a most important step before working in machine learning or any statistical modelling. To make sure what data is that it claims. EDA should be a step of how data science operates in organizations.

EDA is generally classified in two ways. first Graphical or non graphical and second variate or multi variate usually bi variate. Graphical method generally summarize the data in a diagrammatically way and non graphical method involves calculation of summary statistics.

EDA Methods

1. Descriptive statistics

a. Central tendency of distribution (mean, median, mode)

b. Measure of Variance (Standard Deviation, Variation, Quartiles)

c. Skewness and kurtosis (Measure of symmetry, measure of peakedness)

2. visualization

a. 1-dimension

i. Histogram (Few data points)

ii. Density plot (Many data points)

b. 2-dimension

i. Scatterplot

c. 3-dimension

i. Bubble

3. Dimensional reduction: Reduces number of variables to a few interpret able linear combinations for better understanding of the data.

4. Cluster analysis: Organizes similar observation of data into differentiate clusters.

2. Why we use Exploratory Data Analysis

A correct data exploration help data scientist to :

1. Understand expected relationship expected in the data, so to plan technique of analysis.

2. To understand some unexpected structure in the data that need to be address, and also suggesting any changes in the planned technique.

3. To deliver data related insights to the the business and letting them with accurate data instead of they assuming things and not asking any wrong questions.

4. Provide context so that potential value of the data can be address and output can be maximized.

Following is the Data science process and EDA importants is shown in below figure:

3. Lets explore any data

In this article we are going to use the data house prices and are going to perform detail eda on this data. so, lets start with importing the data into R workspace.

You can download the data from here:

Data Description is given below:

library(data.table)
library(testthat)
library(gridExtra)
library(corrplot)
library(GGally)
library(ggplot2)
library(e1071)
library(dplyr)

above are the supporting libraries you need to install for our EDA. you can learn more about ggplot2 in one of my article that is data visualization in R.

Lets import the data.

train <- fread('G:/Blog/eda/train.csv',colClasses=c('MiscFeature' = "character", 'PoolQC' = 'character', 'Alley' = 'character'))

cat_var <- names(train)[which(sapply(train, is.character))]
cat_car <- c(cat_var, 'BedroomAbvGr', 'HalfBath', ' KitchenAbvGr','BsmtFullBath', 'BsmtHalfBath', 'MSSubClass')
numeric_var <- names(train)[which(sapply(train, is.numeric))]

Here we are checking if the column we are using our the class we need. using the above syntax.

Now,

Finding the structure of the data.

The following dataset consist of 1460 rows and 81 columns.

 dim(train)
[1] 1460   81

str(train)

Output:

The above output is showing all the columns in the dataset with there data types. Now lets check the table starting few entries using head() this function will give you the starting few rows of the datasets similarly tail() function will give you the last few rows of the datasets.

Summarizing the missing values in the data.

head(train)

colSums(sapply(train, is.na))
colSums(sapply(train[,cat_var, .SDcols = cat_var], is.na))
colSums(sapply(train[,.SD, .SDcols = numeric_var], is.na))

Output:

Here we are checking if our data consist of any NA values. and we can see that most the houses don't have alley access nor basements therefore there are many NA here in this rows.

Now, lets have some insight on the number of houses that were remodeled

sum(train[,'YearRemodAdd', with = FALSE] != train[,'YearBuilt', with = FALSE])

cat('Percentage of houses remodeled',sum(train[,'YearRemodAdd', with = FALSE] != train[,'YearBuilt', with = FALSE])/ dim(train)[1])


train %>% select(YearBuilt, YearRemodAdd) %>%    
  mutate(Remodeled = as.integer(YearBuilt != YearRemodAdd)) %>% 
  ggplot(aes(x= factor(x = Remodeled, 
                      labels = c( 'No','Yes')))) + 
                      geom_bar() + xlab('Remodeled') + 
                      theme_light()

According our data on yearbuit and yearremoteadd on comparing both we got that almost 696 were remodeled and 764 were not.

therefore the Percentage of houses remodeled is 0.4767123> Output:

Summarizing the numeric value and the structure of the data.

summary(train[,.SD, .SDcols =numeric_var])
cat('Train has', dim(train)[1], 'rows and', dim(train)[2], 'columns.')

Output:

Train has 1460 rows and 81 columns.

sum(is.na(train)) / (nrow(train) *ncol(train))
[1] 0.05889565

# Check for duplicated rows.

cat("The number of duplicated rows are", nrow(train) - nrow(unique(train)))
The number of duplicated rows are 0> 

# Convert character to factors 
train[,(cat_var) := lapply(.SD, as.factor), .SDcols = cat_var]

train_cat <- train[,.SD, .SDcols = cat_var]
train_cont <- train[,.SD,.SDcols = numeric_var]

plotHist <- function(data_in, i) {
  data <- data.frame(x=data_in[[i]])
  p <- ggplot(data=data, aes(x=factor(x))) + stat_count() + xlab(colnames(data_in)[i]) + theme_light() + 
    theme(axis.text.x = element_text(angle = 90, hjust =1))
  return (p)
}

doPlots <- function(data_in, fun, ii, ncol=3) {
  pp <- list()
  for (i in ii) {
    p <- fun(data_in=data_in, i=i)
    pp <- c(pp, list(p))
  }
  do.call("grid.arrange", c(pp, ncol=ncol))
}


plotDen <- function(data_in, i){
  data <- data.frame(x=data_in[[i]], SalePrice = data_in$SalePrice)
  p <- ggplot(data= data) + geom_line(aes(x = x), stat = 'density', size = 1,alpha = 1.0) +
    xlab(paste0((colnames(data_in)[i]), '\n', 'Skewness: ',round(skewness(data_in[[i]], na.rm = TRUE), 2))) + theme_light() 
  return(p)
  
}

Barplot for the categorical features

The barplot below gives insight into the data here mszoning barplot shows that there are majority of the houses located in low density area and medium density area.

doPlots(train_cat, fun = plotHist, ii = 1:4, ncol = 2) doPlots(train_cat, fun = plotHist, ii  = 4:8, ncol = 2) doPlots(train_cat, fun = plotHist, ii = 8:12, ncol = 2) doPlots(train_cat, fun = plotHist, ii = 13:18, ncol = 2) doPlots(train_cat, fun = plotHist, ii = 18:22, ncol = 2)

Output:

train %>% select(LandSlope, Neighborhood, SalePrice) %>% 
  filter(LandSlope == c('Sev', 'Mod')) %>% arrange(Neighborhood) %>% 
  group_by(Neighborhood, LandSlope) %>% summarize(Count = n()) %>% 
  ggplot(aes(Neighborhood, Count)) + 
  geom_bar(aes(fill = LandSlope), 
           position = 'dodge', stat = 'identity') + 
  theme_light() +theme(axis.text.x = element_text(angle = 90, hjust =1))

Output:

The houses that have severe landslope are located in the clear creek and timberland whereas with moderate landslope has more neighborhood. The high slopes has creek and crawford neighborhood.

train %>% select(Neighborhood, SalePrice) %>% 
  ggplot(aes(factor(Neighborhood), SalePrice)) + 
  geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust =1)) + 
  xlab('Neighborhoods')

Output:

Plotting the boxplot in the neighborhoods and sale price brook side has cheap houses. while northridge and north ridge height has higher prices with several outliers.

Density plot for numeric variable

Density plot of yearbuilt shows that there are mix of old and new houses and also shows that recently there has been less number of houses possibly due the house crisis.

doPlots(train_cont, fun = plotDen, ii = 2:6, ncol = 2)
doPlots(train_cont, fun = plotDen, ii = 7:12, ncol = 2)
doPlots(train_cont, fun = plotDen, ii = 13:17, ncol = 2)

Output:

The following histograms shows that majority of the houses have 2 full baths and have an average of 3 bedrooms.

doPlots(train_cont, fun = plotHist, ii = 18:23, ncol = 2)
doPlots(train_cont, fun = plotHist, ii = 18:23, ncol = 2)

Output:

Explore the correlations

correlations <- cor(na.omit(train_cont[,-1, with = FALSE]))

# correlations
row_indic <- apply(correlations, 1, function(x) sum(x > 0.3 | x < -0.3) > 1)

correlations<- correlations[row_indic ,row_indic ]
corrplot(correlations, method="square")

Output:

Scatter plot for variables that have high correlations.

The correlation matrix below show that there are several variables that our strongly correlated to with housing price.

Some of them are:

-OverallQuall

-YearBuilt

-YearRemodAdd and 13 more.

and by seeing the correlations many strong negative correlation can also be found.

train %>% select(OverallCond, YearBuilt) %>% 
  ggplot(aes(factor(OverallCond),YearBuilt)) + 
  geom_boxplot() + 
  xlab('Overall Condition')

Output:

plotCorr <- function(data_in, i){
  data <- data.frame(x = data_in[[i]], SalePrice = data_in$SalePrice)
  p <- ggplot(data, aes(x = x, y = SalePrice)) + geom_point(shape = 1, na.rm = TRUE) + geom_smooth(method = lm ) + xlab(paste0(colnames(data_in)[i], '\n', 'R-Squared: ', round(cor(data_in[[i]], data$SalePrice, use = 'complete.obs'), 2))) + theme_light()
  return(suppressWarnings(p))
}

highcorr <- c(names(correlations[,'SalePrice'])[which(correlations[,'SalePrice'] > 0.5)], names(correlations[,'SalePrice'])[which(correlations[,'SalePrice'] < -0.2)])

data_corr <- train[,highcorr, with = FALSE]

doPlots(data_corr, fun = plotCorr, ii = 1:6)
doPlots(data_corr, fun = plotCorr, ii = 6:11)

Output:

The below histogram for variable saleprice shows that it is skewed (Suddenly changing directions or positions).

library(scales)
ggplot(train, aes(x=SalePrice)) + geom_histogram(col = 'white') + theme_light() +scale_x_continuous(labels = comma)
summary(train[,.(SalePrice)])
# Normalize distribution
ggplot(train, aes(x=log(SalePrice+1))) + 
  geom_histogram(col = 'white') + 
  theme_light()

Output:

Code: You can checkout the whole R code for this article here.

Conclusion:

In this article we have learned what is EDA, classification of EDA, why we use EDA?, and performed a detailed house price data Exploratory Data Analysis. also, learned various methods to perform EDA in R programming.

Thanks for reading this article, I hope you have learned something new and will share this post with your friends and family who are interested to learn data analysis in R programming 😊😊, More R programming basic are on its way stay tuned. Love you all