Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

My Implementation of the Apriori Algorithm

· Java Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Java-based (JDBC) data connectivity to SaaS, NoSQL, and Big Data. Download Now.

This is a self imposed machine problem I wrote over a frantic afternoon for my lesson on Frequent Itemsets and the Apriori Algorithm.

I wanted to write a program that would find the top five frequent item sets among a set of baskets.

Consider the item set below:

D = [ ('milk', 'cheese', 'tuna'),
('cheese', 'mushroom'),
('cheese', 'eggs'),
('milk', 'cheese', 'mushroom'),
('milk', 'eggs'),
('cheese', 'eggs'),
('milk', 'cheese', 'mushroom', 'tuna'),
('milk', 'cheese', 'eggs') ]


The goal I want for my program is to be able to induce that

$milk \to cheese$ (the presence of milk somehow implies the presence of cheese as well)

or

$milk,cheese \to mushroom$ (milk, cheese and mushroom) somehow go together.

Just right now, I think I could also find a correlation between item sets and dates.

Anyway, as I have learned the key is to use map reduce and I was able to do this quite easily with a NoSQL database like Mongo DB.

A link to the code I wrote is available on github.

So far, I am able to generate the candidate item set for 2-combinations of the universal sets. This is another way of saying that the program I wrote is to count the number of times a two combination appears in all the baskets. For example, it is able to count that (‘milk’, ‘cheese’) occurs four times. Below is a screenshot of the script doing just that:

The next step after this is to generate the frequent item sets from the the candidate item set. This just means, I have to filter out non frequent 2-combinations. This can easily be done in two steps and is based on a support count that is defined before hand.

Let’s say a support count of 2 is defined, then the first thing to do is to check if the count of the 1 item sets that make up the two item sets have support count $\geq 2$. Remember that the key to the apriori algorithm is that the subsets of a frequent item set must also be frequent.

This can easily be derived by running the script I wrote:

Now that I am able to generate the candidate item sets for any n-combination of the universal set, some next steps for my program would be to

1. Generate the frequent item set given a candidate item set.
2. Report the top 5 item sets with the highest interest and highest confidence.

Performance considerations also come to mind as my implementation of the algorithm has not been tested for a large data set.

Connect any Java based application to your SaaS data.  Over 100+ Java-based data source connectors.

Topics:

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}