Disclaimer: This example is taken from a reference titled ‘R In A Nutshell’ by Joseph A. Adler
One of my favourite examples from the book is this one. Mostly because it talks about my favourite bands.
Association rules are a popular technique for data mining. The algorithms finds sets of associations, items that are frequently associated with each other. For example, when analyzing supermarket data and consumer buying habits, you might find that consumers often purchased eggs and milk together. Alternatively, you might observed something like this.
If a customer buys bread, he’s 70% likely of buying milk.
– R In A Nutshell, Joseph A. Adler
To experiment with association rules, we are going to use a priori algorithm. Obviously, the first thing to do is to install the package.
install.packages("arules")
The load the dataset.
load("audioscrobbler.rda")
The dataset contains collected data on transactions from Audioscrobbler. This is a (now defunct) service similar to Spotify or Apple Music where people can streams for music. We are going to analyze the data and study the listening habits and preferences of users.
Then call the apriori() function to start the modeling.
> audioscrobbler.apriori <- apriori(
+ data=audioscrobbler,
+ parameter=new("APparameter",support=0.0645)
+ )
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.8 0.1 1 none FALSE TRUE 5 0.0645 1 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 1290
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[429033 item(s), 20001 transaction(s)] done [3.16s].
sorting and recoding items ... [287 item(s)] done [0.07s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 done [0.10s].
writing ... [10 rule(s)] done [0.00s].
creating S4 object ... done [0.07s].
As you can see, the apriori function includes some information on what it is doing while running. After it finishes, check out the returned object.
Call the summary function to get detailed results.
> summary(audioscrobbler.apriori)
set of 10 rules
rule length distribution (lhs + rhs):sizes
3
10
Min. 1st Qu. Median Mean 3rd Qu. Max.
3 3 3 3 3 3
summary of quality measures:
support confidence coverage lift count
Min. :0.06475 Min. :0.8008 Min. :0.07915 Min. :2.613 Min. :1295
1st Qu.:0.06536 1st Qu.:0.8027 1st Qu.:0.08083 1st Qu.:2.619 1st Qu.:1307
Median :0.06642 Median :0.8076 Median :0.08155 Median :2.651 Median :1328
Mean :0.06640 Mean :0.8128 Mean :0.08171 Mean :2.696 Mean :1328
3rd Qu.:0.06707 3rd Qu.:0.8178 3rd Qu.:0.08281 3rd Qu.:2.761 3rd Qu.:1342
Max. :0.06870 Max. :0.8399 Max. :0.08460 Max. :2.888 Max. :1374
mining info:
data ntransactions support confidence
audioscrobbler 20001 0.0645 0.8
Then call inspect() function to view the return rules.
> inspect(audioscrobbler.apriori)
lhs rhs support confidence coverage lift count
[1] {Jimmy Eat World,blink-182} => {Green Day} 0.06524674 0.8085502 0.08069597 2.780095 1305
[2] {The Strokes,Coldplay} => {Radiohead} 0.06619669 0.8019382 0.08254587 2.616996 1324
[3] {Interpol,Beck} => {Radiohead} 0.06474676 0.8180670 0.07914604 2.669629 1295
[4] {Interpol,Coldplay} => {Radiohead} 0.06774661 0.8008274 0.08459577 2.613371 1355
[5] {The Beatles,Interpol} => {Radiohead} 0.06719664 0.8047904 0.08349583 2.626303 1344
[6] {The Offspring,blink-182} => {Green Day} 0.06664667 0.8399496 0.07934603 2.888058 1333
[7] {Foo Fighters,blink-182} => {Green Day} 0.06669667 0.8169014 0.08164592 2.808810 1334
[8] {Pixies,Beck} => {Radiohead} 0.06569672 0.8066298 0.08144593 2.632306 1314
[9] {The Smashing Pumpkins,Beck} => {Radiohead} 0.06869657 0.8287093 0.08289586 2.704359 1374
[10] {The Smashing Pumpkins,Pink Floyd} => {Radiohead} 0.06514674 0.8018462 0.08124594 2.616695 1303
Now, let’s analyze the results.
The left-hand-side (lhs) of the rules forms the predicate of the rule; whereas the right-hand-side (rhs) forms the conclusion.
For example, consider rule 1. This rule says “If a user listened to Jimmy Eat World and blink-182, then for 6.524674% of the time, he or she also listened to Green Day”.
The arules package also includes an implementation of the Eclat algorithm, which finds frequent item sets.
> # eclat()
> audioscrobbler.eclat <- eclat(
+ data=audioscrobbler,
+ parameter=new("ECparameter",support=0.129,minlen=2)
+ )
Eclat
parameter specification:
tidLists support minlen maxlen target ext
FALSE 0.129 2 10 frequent itemsets TRUE
algorithmic control:
sparse sort verbose
7 -2 TRUE
Absolute minimum support count: 2580
create itemset ...
set transactions ...[429033 item(s), 20001 transaction(s)] done [2.84s].
sorting and recoding items ... [74 item(s)] done [0.08s].
creating bit matrix ... [74 row(s), 20001 column(s)] done [0.01s].
writing ... [10 set(s)] done [0.00s].
Creating S4 object ... done [0.27s].
Then call the summary() and inspect() functions to view the return results.
> summary(audioscrobbler.eclat)
set of 10 itemsets
most frequent items:
Green Day Radiohead Red Hot Chili Peppers Nirvana The Beatles
5 5 3 3 2
(Other)
2
element (itemset/transaction) length distribution:sizes
2
10
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 2 2 2 2 2
summary of quality measures:
support transIdenticalToItemsets count
Min. :0.1291 Min. :2582 Min. :2582
1st Qu.:0.1303 1st Qu.:2606 1st Qu.:2606
Median :0.1360 Median :2720 Median :2720
Mean :0.1382 Mean :2764 Mean :2764
3rd Qu.:0.1394 3rd Qu.:2789 3rd Qu.:2789
Max. :0.1567 Max. :3134 Max. :3134
includes transaction ID lists: FALSE
mining info:
data ntransactions support
audioscrobbler 20001 0.129
> inspect(audioscrobbler.eclat)
items support transIdenticalToItemsets count
[1] {Red Hot Chili Peppers,Radiohead} 0.1290935 2582 2582
[2] {Red Hot Chili Peppers,Green Day} 0.1397430 2795 2795
[3] {Red Hot Chili Peppers,Nirvana} 0.1336433 2673 2673
[4] {Nirvana,Radiohead} 0.1384931 2770 2770
[5] {Green Day,Nirvana} 0.1382931 2766 2766
[6] {Coldplay,Radiohead} 0.1538423 3077 3077
[7] {Coldplay,Green Day} 0.1292435 2585 2585
[8] {Green Day,Radiohead} 0.1335433 2671 2671
[9] {The Beatles,Green Day} 0.1290935 2582 2582
[10] {The Beatles,Radiohead} 0.1566922 3134 3134
Click here to download the dataset.