Getting Started With Data Mining
Join the DZone community and get the full member experience.
Join For Freehere are just notes from my data mining class which i will begin to consolidate here in my blog as a way to assimilate the lessons.
1. the market basket model is probably the easiest introduction to anyone interested in data mining. the concept is simple. there are baskets and there are items in those baskets.
the market basket model: there are items and there baskets, also called itemsets, that hold those items.
2. closely related to the the market basket model is the concept of
frequent itemsets. intuitively, a set of items is frequent if it occurs
many times.
3. the following terms are used a lot when talking about frequent itemsets:
-
support count – is a term that refers to the number of times an
itemset appears in a set of baskets. for example in the basket set
[ ('cheese', 'milk', 'eggs'), ('milk'), ('milk', 'eggs', 'bread') ]
the support count of the itemset (‘milk’, ‘eggs’) is 2 since it was a subset two times - support threshold – is a numerical limit that draws the line between a frequent itemset and non-frequent itemset. for example, in the basket set above, if one sets the support threshold at >= 2, then one can say that the itemset (‘milk’, ‘eggs’) is frequent.
4. frequent itemsets are presented as an if-then rule like so:
where i is a set of items and j is an item. this representation is
called an association rule. in words, it can be said that if i appears
in a basket then j is “likely” to appear as well.
5. in data mining parlance, the concept of ‘likely’ is more formally known as the confidence of the rule
. mathematically,
the insights behind this formula are that
-
baskets with
in them cannot be more than baskets with i in them. think about it. if one has i then there may or may not be the j around it in the same basket.
-
having said that, the more baskets with
in them, the better. this makes the confidence in the rule stronger
-
if
is the same as the
then that means that the confidence is 1 or 100%. in other words
all the time!
6. in an association rule, interest is an indicator of how the item on the left affects the item on the right in
. the formula is:
the insight behind this formula is that
- if the confidence outweighs the fraction of baskets with j, then it can be said that there is indeed a correlation between i and j and/or the presence of i somehow affects the presence of j.
-
on the other hand, if there are significantly more baskets with j but not i, then the association rule
isn’t really strong. it definitely is not the presence of i that implies the presence of j but something else. the instances when i and j are together in a basket can be said to be isolated
Published at DZone with permission of Jose Asuncion, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments