Disclaimer: This example is taken from a reference titled ‘R In A Nutshell’ by Joseph A. Adler
In Machine Learning (ML), Regression is a supervised learning technique used to observe a correlation between variables (features/predictor) and to predict the continuous output (label/response) based on one or more variables. It is mainly used in prediction, forecasting, time-series modeling and determining the causal-effect relationship between variables.
One of the simplest technique of regression analysis is Linear Regression.
A linear regression assumes that there is a linear relationship between the response (label) variable and the predictor (features) variable
– R In A Nutshell, Joseph A. Adler
To study how linear regression works, let’s look at an example taken from the reference book. In this example, we are going to analyze a sample dataset collected on baseball teams scores.
First, load the lattice package.
library(lattice)
Next, load the dataset.
load("team.batting.00to08.rda")
I like to check the dimension of the dataset before I proceed, so that I have an idea of how big the dataset that I am going to work on.
> dim(team.batting.00to08)
[1] 270 13
Then, have a peek at the sample data in the dataset.
> head(team.batting.00to08)
teamID yearID runs singles doubles triples homeruns walks stolenbases
1 ANA 2000 864 995 309 34 236 608 93
2 BAL 2000 794 992 310 22 184 558 126
3 BOS 2000 792 988 316 32 167 611 43
4 CHA 2000 978 1041 325 33 216 591 119
5 CLE 2000 950 1078 310 30 221 685 113
6 DET 2000 823 1028 307 41 177 562 83
caughtstealing hitbypitch sacrificeflies atbats
1 52 47 43 5628
2 65 49 54 5549
3 30 42 48 5630
4 42 53 61 5646
5 34 51 52 5683
6 38 43 49 5644
Ok, now we are ready to begin our analysis. We are going to see how different features (metrics) such as singles, doubles, triples, homeruns, walks, stolenbases, caughtstealing, hitbypitch, sacrificeflies and atbats are related to the number of runs for each baseball team.
Let’s start with this;
runs~singles+doubles+triples+homeruns+walks+stolenbases
+caughtstealing+hitbypitch+sacrificeflies+atbats
A linear regression model suggested that;
y = I + C1x1 + C2x2 + C3x3 + ...
where I is the Interceptor and C is the Coefficient, while y is the response and x is the predictors.
Before we train the data, we are going to transform it into a data frame first.
attach(team.batting.00to08)
forplot <- make.groups(
singles = data.frame(value=singles, teamID,yearID,runs),
doubles = data.frame(value=doubles, teamID,yearID,runs),
triples = data.frame(value=triples, teamID,yearID,runs),
homeruns = data.frame(value=homeruns, teamID,yearID,runs),
walks = data.frame(value=walks, teamID,yearID,runs),
stolenbases = data.frame(value=stolenbases, teamID,yearID,runs),
caughtstealing = data.frame(value=caughtstealing,teamID,yearID,runs),
hitbypitch = data.frame(value=hitbypitch, teamID,yearID,runs),
sacrificeflies = data.frame(value=sacrificeflies,teamID,yearID,runs)
)
detach(team.batting.00to08)
Then, plot it so that we can observe the distributions of the data.
xyplot(runs~value|which, data=forplot,
scales=list(relation="free"), pch=19, cex=.2,
strip=strip.custom(strip.levels=TRUE,
horizontal=TRUE,
par.strip.text=list(cex=.8)))

Based on the plot, we can see that as predicted, teams that hit a lot of homeruns definitely score more runs. Eventually, teams that walks also able to score more runs as well. Interesting.
Next, training the data so that we can check out our hypotheses about the relationship between the features (singles, doubles, homeruns, etc.) and the number of runs for each team. For training, we are going to use the lm function.
runs.mdl <- lm(formula=runs~singles+doubles+triples+homeruns+walks+hitbypitch+sacrificeflies+stolenbases+caughtstealing, data=team.batting.00to08)
View the model returned.
> runs.mdl
Call:
lm(formula = runs ~ singles + doubles + triples + homeruns +
walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
data = team.batting.00to08)
Coefficients:
(Intercept) singles doubles triples homeruns
-507.16020 0.56705 0.69110 1.15836 1.47439
walks hitbypitch sacrificeflies stolenbases caughtstealing
0.30118 0.37750 0.87218 0.04369 -0.01533
Use the summary function to get the summary of the linear model object.
> summary(runs.mdl)
Call:
lm(formula = runs ~ singles + doubles + triples + homeruns +
walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing,
data = team.batting.00to08)
Residuals:
Min 1Q Median 3Q Max
-71.902 -11.828 -0.419 14.658 61.874
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -507.16020 32.34834 -15.678 < 2e-16 ***
singles 0.56705 0.02601 21.801 < 2e-16 ***
doubles 0.69110 0.05922 11.670 < 2e-16 ***
triples 1.15836 0.17309 6.692 1.34e-10 ***
homeruns 1.47439 0.05081 29.015 < 2e-16 ***
walks 0.30118 0.02309 13.041 < 2e-16 ***
hitbypitch 0.37750 0.11006 3.430 0.000702 ***
sacrificeflies 0.87218 0.19179 4.548 8.33e-06 ***
stolenbases 0.04369 0.05951 0.734 0.463487
caughtstealing -0.01533 0.15550 -0.099 0.921530
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 23.21 on 260 degrees of freedom
Multiple R-squared: 0.9144, Adjusted R-squared: 0.9114
F-statistic: 308.6 on 9 and 260 DF, p-value: < 2.2e-16
Here’s the ANOVA statistics for the model.
> anova(runs.mdl)
Analysis of Variance Table
Response: runs
Df Sum Sq Mean Sq F value Pr(>F)
singles 1 215755 215755 400.4655 < 2.2e-16 ***
doubles 1 356588 356588 661.8680 < 2.2e-16 ***
triples 1 237 237 0.4403 0.507565
homeruns 1 790051 790051 1466.4256 < 2.2e-16 ***
walks 1 114377 114377 212.2971 < 2.2e-16 ***
hitbypitch 1 7396 7396 13.7286 0.000258 ***
sacrificeflies 1 11726 11726 21.7643 4.938e-06 ***
stolenbases 1 357 357 0.6632 0.416165
caughtstealing 1 5 5 0.0097 0.921530
Residuals 260 140078 539
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Based on the statistics, it appears that triples, stolenbases and caughtstealing are not significant to the number of runs scored by each team.
Click here to download the dataset.