DZone
Java Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Java Zone > Getting Started With Data Mining

Getting Started With Data Mining

Jose Asuncion user avatar by
Jose Asuncion
·
Mar. 04, 12 · Java Zone · Interview
Like (0)
Save
Tweet
9.23K Views

Join the DZone community and get the full member experience.

Join For Free

here are just notes from my data mining class which i will begin to consolidate here in my blog as a way to assimilate the lessons.

1. the market basket model is probably the easiest introduction to anyone interested in data mining. the concept is simple. there are baskets and there are items in those baskets.

the market basket model: there are items and there baskets, also called itemsets, that hold those items.


2. closely related to the the market basket model is the concept of frequent itemsets. intuitively, a set of items is frequent if it occurs many times.

3. the following terms are used a lot when talking about frequent itemsets:

  • support count – is a term that refers to the number of times an itemset appears in a set of baskets. for example in the basket set

    [ ('cheese', 'milk', 'eggs'), ('milk'), ('milk', 'eggs', 'bread') ]


    the support count of the itemset (‘milk’, ‘eggs’) is 2 since it was a subset two times

  • support threshold – is a numerical limit that draws the line between a frequent itemset and non-frequent itemset. for example, in the basket set above, if one sets the support threshold at >= 2, then one can say that the itemset (‘milk’, ‘eggs’) is frequent.

4. frequent itemsets are presented as an if-then rule like so: i \to j where i is a set of items and j is an item. this representation is called an association rule. in words, it can be said that if i appears in a basket then j is “likely” to appear as well.

5. in data mining parlance, the concept of ‘likely’ is more formally known as the confidence of the rule i \to j . mathematically,

confidence(i \to j) = \frac{support(i u {j})}{support(i)}


the insights behind this formula are that

  • baskets with i \cup {j} in them cannot be more than baskets with i in them. think about it. if one has i then there may or may not be the j around it in the same basket.
  • having said that, the more baskets with i \cup {j} in them, the better. this makes the confidence in the rule stronger
  • if support(i \cup {j}) is the same as the support(i) then that means that the confidence is 1 or 100%. in other words i \to j all the time!

6. in an association rule, interest is an indicator of how the item on the left affects the item on the right in i \to {j} . the formula is:

interest = confidence(i \to j) - \frac{\# of item j in baskets}{\# of baskets}


the insight behind this formula is that

  • if the confidence outweighs the fraction of baskets with j, then it can be said that there is indeed a correlation between i and j and/or the presence of i somehow affects the presence of j.

  • on the other hand, if there are significantly more baskets with j but not i, then the association rule i \to {j} isn’t really strong. it definitely is not the presence of i that implies the presence of j but something else. the instances when i and j are together in a basket can be said to be isolated




Data mining

Published at DZone with permission of Jose Asuncion, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Introduction to JWT (Also JWS, JWE, JWA, JWK)
  • Are All Kubernetes Ingresses the Same?
  • Is DataOps the Future of the Modern Data Stack?
  • No Code Expectations vs Reality

Comments

Java Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo