Association Rules in R

Disclaimer: This example is taken from a reference titled ‘R In A Nutshell’ by Joseph A. Adler

One of my favourite examples from the book is this one. Mostly because it talks about my favourite bands.

Association rules are a popular technique for data mining. The algorithms finds sets of associations, items that are frequently associated with each other. For example, when analyzing supermarket data and consumer buying habits, you might find that consumers often purchased eggs and milk together. Alternatively, you might observed something like this.

If a customer buys bread, he’s 70% likely of buying milk.

R In A Nutshell, Joseph A. Adler

To experiment with association rules, we are going to use a priori algorithm. Obviously, the first thing to do is to install the package.

install.packages("arules")

The load the dataset.

load("audioscrobbler.rda")

The dataset contains collected data on transactions from Audioscrobbler. This is a (now defunct) service similar to Spotify or Apple Music where people can streams for music. We are going to analyze the data and study the listening habits and preferences of users.

Then call the apriori() function to start the modeling.

> audioscrobbler.apriori <- apriori(
+     data=audioscrobbler, 
+     parameter=new("APparameter",support=0.0645)
+ )
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen maxlen target  ext
        0.8    0.1    1 none FALSE            TRUE       5  0.0645      1     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 1290 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[429033 item(s), 20001 transaction(s)] done [3.16s].
sorting and recoding items ... [287 item(s)] done [0.07s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 done [0.10s].
writing ... [10 rule(s)] done [0.00s].
creating S4 object  ... done [0.07s].

As you can see, the apriori function includes some information on what it is doing while running. After it finishes, check out the returned object.

Call the summary function to get detailed results.

> summary(audioscrobbler.apriori)
set of 10 rules

rule length distribution (lhs + rhs):sizes
 3 
10 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      3       3       3       3       3       3 

summary of quality measures:
    support          confidence        coverage            lift           count     
 Min.   :0.06475   Min.   :0.8008   Min.   :0.07915   Min.   :2.613   Min.   :1295  
 1st Qu.:0.06536   1st Qu.:0.8027   1st Qu.:0.08083   1st Qu.:2.619   1st Qu.:1307  
 Median :0.06642   Median :0.8076   Median :0.08155   Median :2.651   Median :1328  
 Mean   :0.06640   Mean   :0.8128   Mean   :0.08171   Mean   :2.696   Mean   :1328  
 3rd Qu.:0.06707   3rd Qu.:0.8178   3rd Qu.:0.08281   3rd Qu.:2.761   3rd Qu.:1342  
 Max.   :0.06870   Max.   :0.8399   Max.   :0.08460   Max.   :2.888   Max.   :1374  

mining info:
           data ntransactions support confidence
 audioscrobbler         20001  0.0645        0.8

Then call inspect() function to view the return rules.

> inspect(audioscrobbler.apriori)
     lhs                                   rhs         support    confidence coverage   lift     count
[1]  {Jimmy Eat World,blink-182}        => {Green Day} 0.06524674 0.8085502  0.08069597 2.780095 1305 
[2]  {The Strokes,Coldplay}             => {Radiohead} 0.06619669 0.8019382  0.08254587 2.616996 1324 
[3]  {Interpol,Beck}                    => {Radiohead} 0.06474676 0.8180670  0.07914604 2.669629 1295 
[4]  {Interpol,Coldplay}                => {Radiohead} 0.06774661 0.8008274  0.08459577 2.613371 1355 
[5]  {The Beatles,Interpol}             => {Radiohead} 0.06719664 0.8047904  0.08349583 2.626303 1344 
[6]  {The Offspring,blink-182}          => {Green Day} 0.06664667 0.8399496  0.07934603 2.888058 1333 
[7]  {Foo Fighters,blink-182}           => {Green Day} 0.06669667 0.8169014  0.08164592 2.808810 1334 
[8]  {Pixies,Beck}                      => {Radiohead} 0.06569672 0.8066298  0.08144593 2.632306 1314 
[9]  {The Smashing Pumpkins,Beck}       => {Radiohead} 0.06869657 0.8287093  0.08289586 2.704359 1374 
[10] {The Smashing Pumpkins,Pink Floyd} => {Radiohead} 0.06514674 0.8018462  0.08124594 2.616695 1303 

Now, let’s analyze the results.

The left-hand-side (lhs) of the rules forms the predicate of the rule; whereas the right-hand-side (rhs) forms the conclusion.

For example, consider rule 1. This rule says “If a user listened to Jimmy Eat World and blink-182, then for 6.524674% of the time, he or she also listened to Green Day”.

The arules package also includes an implementation of the Eclat algorithm, which finds frequent item sets.

> # eclat()
> audioscrobbler.eclat <- eclat(
+     data=audioscrobbler, 
+     parameter=new("ECparameter",support=0.129,minlen=2)
+     )
Eclat

parameter specification:
 tidLists support minlen maxlen            target  ext
    FALSE   0.129      2     10 frequent itemsets TRUE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 2580 

create itemset ... 
set transactions ...[429033 item(s), 20001 transaction(s)] done [2.84s].
sorting and recoding items ... [74 item(s)] done [0.08s].
creating bit matrix ... [74 row(s), 20001 column(s)] done [0.01s].
writing  ... [10 set(s)] done [0.00s].
Creating S4 object  ... done [0.27s].

Then call the summary() and inspect() functions to view the return results.

> summary(audioscrobbler.eclat)
set of 10 itemsets

most frequent items:
            Green Day             Radiohead Red Hot Chili Peppers               Nirvana           The Beatles 
                    5                     5                     3                     3                     2 
              (Other) 
                    2 

element (itemset/transaction) length distribution:sizes
 2 
10 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      2       2       2       2       2       2 

summary of quality measures:
    support       transIdenticalToItemsets     count     
 Min.   :0.1291   Min.   :2582             Min.   :2582  
 1st Qu.:0.1303   1st Qu.:2606             1st Qu.:2606  
 Median :0.1360   Median :2720             Median :2720  
 Mean   :0.1382   Mean   :2764             Mean   :2764  
 3rd Qu.:0.1394   3rd Qu.:2789             3rd Qu.:2789  
 Max.   :0.1567   Max.   :3134             Max.   :3134  

includes transaction ID lists: FALSE 

mining info:
           data ntransactions support
 audioscrobbler         20001   0.129
> inspect(audioscrobbler.eclat)
     items                             support   transIdenticalToItemsets count
[1]  {Red Hot Chili Peppers,Radiohead} 0.1290935 2582                     2582 
[2]  {Red Hot Chili Peppers,Green Day} 0.1397430 2795                     2795 
[3]  {Red Hot Chili Peppers,Nirvana}   0.1336433 2673                     2673 
[4]  {Nirvana,Radiohead}               0.1384931 2770                     2770 
[5]  {Green Day,Nirvana}               0.1382931 2766                     2766 
[6]  {Coldplay,Radiohead}              0.1538423 3077                     3077 
[7]  {Coldplay,Green Day}              0.1292435 2585                     2585 
[8]  {Green Day,Radiohead}             0.1335433 2671                     2671 
[9]  {The Beatles,Green Day}           0.1290935 2582                     2582 
[10] {The Beatles,Radiohead}           0.1566922 3134                     3134

Click here to download the dataset.