Prediction model in R programming

Mohammed Zeeshan Jagirdar
Apr 1, 2020
7 min read

Updated: Apr 17, 2020

Predictive model are very useful for forecasting future outcomes and estimating metrics that are impractical to measure. example, The Predictive model can be used by a data scientist to forecast crop yields based on rainfall and temperature,

so let's start with understanding the typical data scientist workflow for a predictive model. 1. Collect some relevant data for the problem. 2. Clean the data into a convenient form. 3. Conduct an exploratory data analysis to get a convenient better knowledge. 4. Construct a model of some aspect of the data. 5. Using the model for answering the problem that can be predictions most of the times. Hello people, This is Zeeshan Jagirdar Software engineer, In this blog we are going to learn the following concepts: 1. What is Regression and why we use Regressions. 2. Types of Regression. 3. Why Linear Regression? 4. Working on Datasets. 5. Assumptions for linear regression. 6. Analysis for linear Regression. 7. Visualizing the result.

8. Predicting model of the result.

Before getting started with this article it is recommended to read the following article for the basics in R programming. There will be more article for basics in R programming. 1. Basics of R programming 2. Data Cleaning in R

3. Data Visualization in R

There is a list of datasets with link to download at the bottom of this blog if you wanna practice on a real-time dataset must checkout the list.

So, After reading above recommended blogs lets start with understanding what is Regression.

1. What is Regression?

Regression is a statistical analysis attempts to predict the effect of one or more variable on another variable. regression analysis is often used in business and investment world to attempt to predict the effect of certain inputs on an output. The variable being influenced is called dependent variable. The other variable called as independent variable. Regression is also used to determine covariance and correlation. This are important information for investors who want diversify with stocks that are not correlated to the ones there already own.

2. Types of Regression.

The two basic types of regressions that we are going to learn in this articles are simple linear regression and multi linear regression. Also, there are many more complicated regression types for complex data analysis. Simple linear regression uses one independent variable X to predict the outcome of other dependent variable Y while multi linear uses more then one independent variable to predict the outcomes.

There are many more types of regressions like :

1. Logistic Regression.

2. Polynomial Regression.

3. Step wise Regression.

4. ElasticNet Regression.

3. Why Linear Regression?

Linear Regression is one of the most simple machine learning algorithms used by data scientist for predictive modeling, In this article, we will be using linear regression to build a model that will predict the happiness (Dependent variable) based on the income (Independent variable).

The general formula for the both regression type is are as follows:

simple linear regression: Y=a+bX+u

Multi linear Regression: Y=a+b1X1+b2X2+b3X3+.....+bnXn+u

Where

Y: The dependent variable that we are trying to predict.

X: The independent variable that is used to predict Y.

a: The intercept.

b: The slope.

u: The regression residual.

In linear Regression there is one independent variable (X) if there are more than one independent variable (X Y Z ....) it is called as multi linear regression.

4. Working on Datasets

In this article, we will be working on two examples and for both the examples there will be two different datasets. so, lets start without wasting anytime.

The first example is to predict the happiness based on the income of a employee we have a data called "income.data" which is in CSV format downloaded from Kaggle. The first step for using linear regression is to determine whether the data fulfill the assumptions for making a accurate predictive model using linear regression but before that first check if the numerical values in the data are normally distributed or not using summary () which will let us know each column mean, median, max, min etc.

Step 1: Import library and data (income.data)

library(ggplot2) #For visualization
library(dplyr)    #For data manipulations
library(broom)    #cleaning messy outputs
library(ggpubr)    #supports ggplot2 for easy to use functions
setwd("G:/r/")    #sets working directory
my_data<- read.csv("income.data.csv")    #read CSV data
View(my_data)    #view data

Output:

Lets check if the data has been read properly using summary ()

summary(my_data)    #Display Mean, Median, Min, Max of the data

Output:

5. Assumptions for linear regression.

Step 2: Checking if data meets the assumptions for linear regression.

There are 4 types of assumptions that needs to be fulfilled by the data in order to perform linear regression.

1. Independence of observation.

Since, Our data has only one independent variable and one dependent variable; we no need to test hidden relationship in variables. If our data has some hidden relations between any variables we would do that with function cor(),

2. Normality

To check if our dependent variable that is happiness follows a normal distribution, we will use hist() for histogram and to check its pattern.

hist(my_data$happiness)#plot Histogram for the column happiness in data

Output:

Since, the observation is roughly bell shaped means, more observation in middle and less on the tail. Since, it is normal distribution so; we can proceed with linear regression.

3. Linearity

The relation between independent (income) and dependent (happiness) variable must be linear. We can check the linearity with scatter plot if the distribution of data points could be described with straight line or not.

plot(happiness ~ income, data = my_data)    #plot scatterplot income at x and happiness at y from the data

Output:

The relationship looks roughly linear; therefore we can proceed with linear model.

6. Analysis for linear Regression.

Step 3: Linear Regression analysis

Let’s see if there is a linear relationship between income and happiness in our survey of 500 people with incomes ranging from 15k to 75k, where happiness is measured on a scale of 1 to 10

# lm() is linear model function which provide output the results for linear regression model.
income.happiness.lm <- lm(happiness ~ income, data = my_data)    
summary(income.happiness.lm) #provide the result.

Output:

Here, first line is Model equation use for linear regression. Residuals are the unexplained variance. They are not exactly the same as model error, but they are calculated from it, so seeing a bias in the residuals would also indicate a bias in the error.The most important thing to look for is that the red lines representing the mean of the residuals are all basically horizontal and centered on zero. This means there are no outliers or biases in the data that would make a linear regression invalid.

Lets, check for the 4th assumption for linear regression.

4. Homoscedaticity

This is one more assumption which needs to be fulfilled to make a linear regression, so before getting into linear regression visualization we need to check if in our data the minor error which we can see in residue in above summary doesn’t change significantly across the values of (income) independent variable.

par(mfrow=c(2,2))    #To divide the 4 graphs in oneframe
plot(income.happiness.lm)    #plot the graph of lm()

Output:

7. Visualizing the result.

Step 4: Visualizing the results with graph.

Following are the steps to visualize a simple linear regression.

1. Plot the data points on the graph.

my_graph<-ggplot(my_data, aes(x=income, y=happiness))+geom_point()
my_graph

Output:

2. Add the linear regression line to the plotted data

my_graph <- my_graph + geom_smooth(method="lm", col="black")
my_graph

Output:

3. Add the formula on the graph

my_graph <- my_graph + 
  stat_regline_equation(label.x = 3, label.y = 7)
my_graph

Output:

4. Make the graph ready for the final result.

my_graph +
  theme_bw() +
  labs(title = "Reported happiness as a function of income",
       x = "Income (x$10,000)",
       y = "Happiness score (0 to 10)")

Output:

8. Predicting model of the result.

Step 5: Predicting Linear Regression Model

Now, we have built the linear regression model using our income.data dataset, so before getting started with prediction let’s just split our dataset in 80:20 ratio so that we can perform our prediction on 20% data called as test data and we will have 80% data unharmed which will be training data. This is a preferred practice in data analysis.

1. Creating training (development) and test (validation) data samples from original data.

# Create Training and Test data -
set.seed(100)  # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(my_data), 0.8*nrow(my_data))  # row for training data
trainingData <- my_data[trainingRowIndex, ]  # training data
testData  <- my_data[-trainingRowIndex, ]   # test data

2. Develop the model on training data and use it to predict the income on test data

lmMod <- lm(happiness ~ income, data=trainingData)  # build the model
incomePred <- predict(lmMod, testData)  # predict income

3. Check if the data is correct read using summary and compare it with the original dataset if there is minor changes between the values like p values, then our training data can be said as statistically significant model

summary(lmMod)

Output:

After comparing our training data and original data in step 3 we can see that our p value of both the data are less than significant value so its statistically significant model.

4. Calculate prediction accuracy and error rates.

actuals_preds <- data.frame(cbind(actuals=testData$income, predicteds=incomePred))  # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)  # 82.7%
head(actuals_preds)

Output:

Now, let’s calculate the minmax accuracy and mean absolute percentage error with its general formula.

Output: 76 % minmax accuracy

Output: 23% Mean absolute percentage error

Step 6: visualizing predicted Linear Regression Model with original model for difference in error.

library(DAAG) #Data Analysis and Graphics data and functions
cvResults <- suppressWarnings(CVlm(my_data, form.lm=income ~ happiness, 
              m=5, dots=FALSE, seed=29, legend.pos="topleft",  
              printit=FALSE, main="Small symbols are predicted values while bigger ones are actuals."))

Output:

By doing this, we are checking two things:

1. If the model’s prediction accuracy isn’t varying too much for any one particular sample, and

2. If the lines of best fit don’t vary too much with respect the slope and level.

List of datasets for mini projects in R:

You can download the dataset from the below topics:

1. Analysis and prediction on corona virus.

2. Hotel booking demand.

3. Tesla stock data from 2010 to 2020.

4. Indian General Election Data.

5. Video Game Sales.

6. Netflix Movies and TV shows.

7. Trending YouTube videos.

8. Student performance in exams.

I Hope, now you can determine a predictive model using simple linear regression model, ill be explaining about different regression model in my coming blogs and also different algorithm in machine learning. Thank you.

If any difficulty in downloading the dataset or in any operations in this article or any R programming article please feel free to comment down below or contact me. Ill be happy to help.

Tags: #rprogramming #datascience #data #seo #dataanalysis #analysis #statistics #corona #covid19 #stayhome #staysafe #machinelearning #artificialintelligence #prediction #datacleaning #cleandata