Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Speeding Through Classification Problems and Evaluating Outcomes

DZone 's Guide to

Speeding Through Classification Problems and Evaluating Outcomes

Learn how to use big data sets and the R language to create and work with classification and prediction models.

· Big Data Zone ·
Free Resource

Classification is about simplification so as to aid cognition. For example, say you go to a store, and you decide broadly on the categories of goods you would like purchase. What do I need in so and so category?

Whether it is dairy, veggies, wheat/rice, and so on. Similarly, when you make a decision about a purchase, it also tends to be binary: Should I add to the shopping cart or not? Should I pay for it, or ask for it to be set aside when checking out?

Banks tend to look at their retail customers in the same way. Should we allow them to open an account or not? Should we sanction them a loan or not? Should we grant you unsecured credit or not?

Credit evaluation is thus a process of engaging in a comprehensive analysis and evaluation of tens or hundreds of potential factors that go into decision making. The decision, however, is mostly either a YES or NO.

Data Columns

Data Columns

We will use sample credit evaluation data (source: https://newonlinecourses.science.psu.edu/stat508/resource/analysis/gcd) for this illustration.

Our first variable is Creditability, which assumes values of either 1 or 0.  

The remaining factors can be thought of to exist in two silos:

  1. Demographic: such as Age, Number of Dependents, Occupation, Gender, Nationality/work status. 

  2. Behavioral: Time lived at current address, Apartment type, Marital status, Length of current employment, Payment status of previous credit, and more.

Obviously, I have to choose a way to model and train this data in order for me to test it with a few examples.  At the same time, the data on which the model is trained must not be used to test it.

Afterall, we do not want the model to parrot the sentences spoken to it. We need a demonstration from a model that it has really learned something from, in order to evaluate new data.

For this purpose, we can derive partitions to generate data for training and testing, respectively.

trainingPointer<-sample(as.numeric(rownames(data)),0.5*nrow(data),replace=TRUE)

train<-data[trainingPointer,]
test<-data[-trainingPointer,]

So the above code simply splits the given data set into two halves: the first is for training and the second is for testing.

We will start with a simple logistics regression model..

model<-glm(Creditability ~ Account.Balance + Payment.Status.of.Previous.Credit + Purpose + Value.Savings.Stocks + Length.of.current.employment + Sex...Marital.Status + Most.valuable.available.asset + Type.of.apartment + Concurrent.Credits + Duration.of.Credit..month.+ Credit.Amount + Age..years., family=binomial, data = train)


Example Data

Example Data

R provides a  * symbol which can enable us to visually identify significant factors (p-value<0.05)

So, begin looking for those factors which have been assigned at least a  *.

You can see that the only demographic factor which is significant for this specific data set is Gender/Marital Status. The rest are behavioral.

Now that our logistics regression model has learned something, let's evaluate that. Is there a way to have our model work out new data set examples?

(For instance, the test data partition that we created moments ago?)

Sure, there is a function for that. 

predictglm<-predict(model, newdata = test, type = "response")

Note that we have chosen the test data set and not the training dataset.

> head(predictglm)
        1         2         4         5         6         7 
0.5956226 0.7667928 0.7486736 0.5850678 0.7608557 0.7726266 
> 

The model has tried to predict Creditability. But obviously, it has done so in continuous/numerical scale. Because, what the output represents is not the actual prediction, but the probability of a given value,in this case, 1. 

This is because, we specified type as Response. If we had not specified that, then we would have gotten a natural logarithm of odds (as to whether the decision should be 0 or 1).

predictglm<-predict(model, newdata = test)
> head(predictglm)
        1         2         4         5         6         7 
0.3872586 1.1902895 1.0915506 0.3436126 1.1573765 1.2232024 

You can quickly verify the relation between different types of responses by working out a simple algebric equation.

exp(head(predictglmWithoutResponseSpecified))/(1+exp(head(predictglmwithoutResponseSpecified)))

So, we are inversing a log of odds to get us back the prediction probabilities.

However, irrespective of the mode, what we need is a binary response and that can be achieved through a custom defined function.

For instance, I can define a threshold of 6 out of 10, above which I can safely conclude that model has a predicted response variable to be 1 rather than 0.

convertRes<-function (x)
{ 

if(as.numeric(x)>0.6) {
return(1)
} 
else
{ 
return(0)
}

}

Now, you can use lapply to apply this custom function to every value in the prediction model.

predictglm<-unlist(lapply(predictglm,convertRes))

You would need a bit of data re-arrangement to get something like this (hint: convert to data.frame and add as columns):

   Creditability glm
1             1   0
2             1   1
4             1   1
5             1   0
6             1   1
7             1   1

This will allow us to use a built-in ROCR package to evaluate model performance in classification.

For instance, we have used ROCR to plot True Positive Rate Versus False Positive Rate, to understand the trade-offs faced by the model with respect to learning classification.

True Positive Rate Versus False Positive Rate

True Positive Rate Versus False Positive Rate

That's it. It is so easy to classify and predict using R.

Topics:
classification models ,big data ,r tutorials ,predictive analytics ,data science

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}