Classification in R – asmaliza.com

Disclaimer: This example is taken from a reference titled ‘R In A Nutshell’ by Joseph A. Adler

Classification is another type of supervised learning in Machine Learning (ML). Classification is a process of categorizing a given set of data into classes. The classes are also known as target, label or categories (thus categorization).

Examples of classification problem includes predicting weather a particular patient is diabetic or not based on their collected medical data/records, or to scan if incoming email is spam or not based on a list of filtered keywords.

Today, we are going to look at an example of a classification problem taken from the reference book. In this example, we will apply a logistic regression model.

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exists. In regression analysis, logistic regression is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail which is represented by an indicator label, where the two values are labeled “0” and “1”.
– Wikipedia

The sample dataset that we will use is a field goal dataset. Each time a kicker attempts a field goal (i.e.. kick), there is a chance that the goal will be successful, and a chance that it will fail. The probability varies according to distance; any attempts closer to the goal posts are more likely to be successful.

As usual before we start, load the lattice package.

library(lattice)

Next, load the dataset.

load("field.goals.rda")

Check the dimension of the dataset.

> dim(field.goals)
[1] 982  10

Call head() on the dataset to get an idea how the data looks like.

> head(field.goals)
  home.team week qtr away.team offense defense  play.type        player yards stadium.type
1       ARI   14   3       WAS     ARI     WAS    FG good   1-N.Rackers    20          Out
2       ARI    2   4       STL     ARI     STL    FG good   1-N.Rackers    35          Out
3       ARI    7   3       TEN     ARI     TEN    FG good   1-N.Rackers    24          Out
4       ARI   12   2       JAC     JAC     ARI    FG good   10-J.Scobee    30          Out
5       ARI    2   3       STL     ARI     STL    FG good   1-N.Rackers    48          Out
6       ARI    7   4       TEN     TEN     ARI FG aborted 15-C.Hentrich    33          Out

Let’s transform the dataset and create a new binary variable for field goals that are either good or bad.

field.goals.forlr <- transform(field.goals, good=as.factor(ifelse(play.type=="FG good","good","bad")))

View the transformed dataset. A new column “good” has been added to the dataset.

> head(field.goals.forlr)
  home.team week qtr away.team offense defense  play.type        player yards stadium.type good
1       ARI   14   3       WAS     ARI     WAS    FG good   1-N.Rackers    20          Out good
2       ARI    2   4       STL     ARI     STL    FG good   1-N.Rackers    35          Out good
3       ARI    7   3       TEN     ARI     TEN    FG good   1-N.Rackers    24          Out good
4       ARI   12   2       JAC     JAC     ARI    FG good   10-J.Scobee    30          Out good
5       ARI    2   3       STL     ARI     STL    FG good   1-N.Rackers    48          Out good
6       ARI    7   4       TEN     TEN     ARI FG aborted 15-C.Hentrich    33          Out  bad

We are going to plot the dataset to have a better understanding on the relationship between field goals and distance (column: yards). Before that, let’s tabulate the dataset.

> field.goals.table <- table(field.goals.forlr$good, field.goals.forlr$yards)
> field.goals.table
      
       18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
  bad   0  0  1  1  1  1  0  0  0  3  5  5  2  6  7  5  3  0  4  3 11  6  7  5  6 11  5  9 12 11 10  9  5  8 11 10  3  1  2  1  1  1  1  1  1
  good  1 12 24 28 24 29 30 18 27 22 26 32 22 21 30 31 21 25 20 23 29 35 27 32 21 15 24 16 15 26 18 14 11  9 12 10  2  1  3  0  1  0  0  0  0

The top row is the yards. Followed by the total bad and good field goals respectively.

Now, we plot the results.

plot(colnames(field.goals.table), field.goals.table["good",]/ (field.goals.table["bad",] +field.goals.table["good",]), xlab="Distance (Yards)", ylab="Percent Good")

We can see from the plot that there is a linear relationship between the field goals and the distance, though with some outliers at the right bottom. It clearly shows that as the distance increases, the number of successful field goals decreases. In other words, the closer the player to the goal posts, the higher the chances of scoring a goal.

To model the probability of a successful field goal using a logistic regression, call glm() function.

> field.goals.mdl <- glm(formula=good~yards, data=field.goals.forlr, family="binomial")
> field.goals.mdl

Call:  glm(formula = good ~ yards, family = "binomial", data = field.goals.forlr)

Coefficients:
(Intercept)        yards  
    5.17886     -0.09726  

Degrees of Freedom: 981 Total (i.e. Null);  980 Residual
Null Deviance:	    978.9 
Residual Deviance: 861.2 	AIC: 865.2

The call the summary function to get a more detailed results. We can see the the coefficients for yards is a negative value, which kind of concurred to our earlier hyphotheses i.e. the closer the player to the goal posts, the higher the chances of having a successful field goal.

> summary(field.goals.mdl)

Call:
glm(formula = good ~ yards, family = "binomial", data = field.goals.forlr)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5582   0.2916   0.4664   0.6978   1.3789  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.178856   0.416201  12.443   <2e-16 ***
yards       -0.097261   0.009892  -9.832   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 978.90  on 981  degrees of freedom
Residual deviance: 861.22  on 980  degrees of freedom
AIC: 865.22

Number of Fisher Scoring iterations: 5

Finally, plot the results to see how this model fits the data, and we are also going to draw a line to show the trends.

> plot(colnames(field.goals.table), field.goals.table["good",]/ (field.goals.table["bad",] + field.goals.table["good",]), xlab="Distance (Yards)", ylab="Percent Good")
> # add a line to chart
> fg.prob <- function(y) { 
+     eta <- 5.178856 + -0.097261 * y;
+     1 / (1 + exp(-eta))
+ }
> lines(15:65,fg.prob(15:65),new=TRUE)

As expected from the statistics above, the model looks like it fits the data reasonably well.

Click here to download the dataset.