top of page
Search

Do data visualization the right way, make it cool

Updated: Apr 17, 2020

In data science, analyzing the data is just not enough, a good data scientist should be able to make a good compelling visualization to show our insights and help others to understand your result. so what are we doing here, In this article you'll learn different resources to explore and show other your data visually. You'll learn how to use ggplot2, a popular data visualization library in R. you'll learn visualizations like line charts, bar plots, scatter plots, boxplot, histograms to understand the data.

By the end of this article i think you'll learn the followin

  • What is data visualization?

  • Why it is used?

  • Visualization in R using ggplot2 library

Hello guys, this is zeeshan. In the last post i have posted the crucial part of data analysis using R that is "Data cleaning in R". R is an amazing data analysis platform. It has a wide scope in this coming technology era, where data science is the most important skills need by any organizations to make their business grow constantly, it may be for customer relationship or for any other need data analysis proved to be the number #1 tool for businesses. I have posted a article few days back where i have listed some recommended books and communities for the newcomer in R programming and also the uses and "why R programming is so important today in 2020".


What is data Visualization??


Data visualization is a technique used for graphical representation of data. by using plots like scatter plots, charts, graphs, histograms, maps etc. with the help of visualization we make our data more understandable. Visualization help to understand patterns, trends in our data, it enable us to convey the information faster and visual way.

It is easier for us to understand and retain the information in picture form. thus, data visualization helps us to understand different variable and understand there patterns and come to conclusion from our data.


What can be the best platform for data visualization better than R, In R data visualization can be performed by following ways:

  1. Base Graphics

  2. Grid Graphics

  3. Lattice Graphics

  4. ggplot2

As, ggplot2 being the most popular package in R used by many researcher, scientist for data visualizations, In this blog i'll show you what ggplot2 can do and why it is better than the other data visualizations in R.

Data visualization in R using ggplot2

ggplot is the grammar of R graphics, which is a set of rules for building graphs.

The ggplot2 consist of the following:

  1. Data

  2. Layers

  3. Scales

  4. Coordinates

  5. Faceting

  6. Themes

ggplot is one of the most sophisticated packages in R for data visualization, and it helps creating beautiful print-quality plots with simple changes. It is very easy to create single or multi variable graphs in R with the help of ggplot2 package.


The three most important component to build ggplot are:

  1. Data: Dataset which needs to be plotted.

  2. Aesthetics: Mapping data for visualization.

  3. Layers: visual elements.

The general syntax of ggplot is:

ggplot(data=NULL, mapping=aes()) + geom_function()

And by using following syntax we can install the ggplot library and import it in the workspace.

install.packages("ggplot2")
library(ggplot2)

So, lets use ggplot in visualizing a data set, in this blog i'll going to use titanic data set from kaggle to demonstrate ggplot library and how it can help to simplify the result understanding.


so lets import the data in R :

titanic <- read.csv("train.csv", stringsAsFactors = FALSE)
View(titanic) #display the table

Output:

names(titanic) #show all the column names in the table
str(titanic) #show the structure of the table

Output:

Now setting the factors:

Sometimes variable in dataset are factors but might get interpreted as numeric. example in above out we can see that pclass is the levels like class 1,2 and 3 so we dont want it to be interpreted as numeric so we will convert it to factors. by using the function as.factor().

# Set up factors.
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Survived <- as.factor(titanic$Survived)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)

now, lets start with plotting graphs accordingly.


1. What was the survival rate on titanic?

ggplot(titanic, aes(x = Survived)) + 
theme_bw() +
geom_bar() +
labs(y = "Passenger Count", title = "Titanic Survival Rates")

In ggplot2 there are many functions to support the library as to make plot look more elegant.

functions like

  1. aes() : It describes which layer in data should be mapped in the graph

  2. theme() : It sets the specific theme for the plot, for example: theme_bw() here.

  3. geom_bar() : It plots bar chart there are several more plotting in ggplot2 like geom_histogram(), geom_point() etc.

  4. labs() : This function is use to label the plot.

Output:

2. What was the survival rate by gender?

ggplot(titanic, aes(x = Sex, fill = Survived)) + 
theme_bw() +
geom_bar() +
labs(y = "Passenger Count",
title = "Titanic Survival Rates by Sex")

Output:

3. What was the survival rate by class of ticket?

ggplot(titanic, aes(x = Pclass, fill = Survived)) + 
theme_bw() +
geom_bar() +
labs(y = "Passenger Count", title = "Titanic Survival Rates by Pclass")

Output:

4. What was the survival rate by class of ticket and gender?

ggplot(titanic, aes(x = Sex, fill = Survived)) + 
theme_bw() +
facet_wrap(~ Pclass) +
geom_bar() +
labs(y = "Passenger Count", title = "Titanic Survival Rates by Pclass and Sex")

We can see here we have used a function name

facet_wrap() : used for making multiple charts.

Output:

5. What is the distribution of passenger ages?

ggplot(titanic, aes(x = Age)) +
theme_bw() +
geom_histogram(binwidth = 5) +
labs(y = "Passenger Count", x = "Age (binwidth = 5)", title = "Titanic Age Distribtion")

here, instead of bar chart we are using histogram since it is more effective in continuous flow of numerical values.

Output:

6. What are the survival rates by age?

#histogram
ggplot(titanic, aes(x = Age, fill = Survived)) +
theme_bw() +
geom_histogram(binwidth = 5) +
labs(y = "Passenger Count", x = "Age (binwidth = 5)", title = "Titanic Survival Rates by Age")

Output:

ggplot(titanic, aes(x = Survived, y = Age)) +
theme_bw() +
geom_boxplot() +
labs(y = "Age",x = "Survived",title = "Titanic Survival Rates by Age")

also, we have used boxplot for the same query. boxplot helps in finding the mean, median in the data.

Output:

7. What is the survival rates by age when segmented by gender and class of ticket?

#Density plot
ggplot(titanic, aes(x = Age, fill = Survived)) +
theme_bw() +
facet_wrap(Sex ~ Pclass) +
geom_density(alpha = 0.5) +
labs(y = "Age",x = "Survived",title = "Titanic Survival Rates by Age, Pclass and Sex")

here, geom_density() function is used to plot density chart which is somewhat same as histogram but can tell you a intersection points of two different columns.

Output:

#histogram
ggplot(titanic, aes(x = Age, fill = Survived)) +
theme_bw() +
facet_wrap(Sex ~ Pclass) +
geom_histogram(binwidth = 5) +
labs(y = "Age",
x = "Survived",
title = "Titanic Survival Rates by Age, Pclass and Sex")

Output:

Summary

In this blog we have learned how to install ggplot2 package and import the data . Also we have learned how to visualize the data which we have imported with different visualization charts like, Histogram, density chart, bar graph and box plot. Thank you.😁


Hope to see you next time. ill be posting more basics of R programming in coming days.

2 comments

Recent Posts

See All
Favorite Links
Recent posts
bottom of page