# The Intersection Between the Top Data Mining Algorithms and AI

# The Intersection Between the Top Data Mining Algorithms and AI

### A look into five of the top data mining algorithms and their use cases for Big Data and machine learning.

Join the DZone community and get the full member experience.

Join For FreeHortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In 2007, a team of professors from the IEEE Conference on Data Mining posted a survey paper on the top 10 data mining algorithms. Here is a list of them:

C4.5

k-Means

Support Vector Machines (SVM)

Apriori

Expectation Maximization (EM)

PageRank

AdaBoost

k-Nearest Neighbors (kNN)

Naive Bayes

Classification and Regression Tree (CART)

Some of these algorithms are playing a very important role in the future of artificial intelligence. According to this GetResponse blog, it is playing an influential role in marketing.

“The technology is of course already there: artificial intelligence is no longer a sci-fi movie thing, but allows you to even automate creativity. Custom audiences and re-targeting options are now a must in advertising.”

Here is an overview of the relationship between several of the most prominent data mining algorithms and the role they are playing in marketing AI.

## C4.5

C4.5 is an algorithm used for developing decision tree matrices. It was developed by Ross Quinlan, an AI expert in the mid-1990s, but is still widely used today.

There are several reasons C4.5 is so useful. One of the biggest advantages of this algorithm is that you can use it to assess the probability of various outcomes. It has a data mining implementation called J48, which can be used to collect data needed to make educated probability estimates.

## Apriori

All modern AI applications need to extrapolate data to make educated projections and identify trends. Since most trends aren’t linear, it can be very difficult for even seasoned mathematicians to form these conclusions on their own.

Apriori is a data mining algorithm that was built for this purpose. It records instances where a value appears in a database. If the value appears frequently enough, it can be used to look for trends.

Apriori is used in applications where large data sets are available. Since it is a very efficient data mining solution, it can process data in real-time. It is widely used for applications where new data is readily available, such as Forex and marketing automation.

## PageRank

PageRank is possibly the most specialized algorithm identified at the 2007 IEEE Conference on Data Mining, so it’s surprising that it was the sixth most important algorithm on their list. It is used exclusively by Google for ranking websites in the search engine index.

While the algorithm isn’t open-source or readily available for projects outside Google, it is still an interesting case study on the way data mining algorithms shape the future of AI.

The PageRank algorithm was originally used to rank websites based on their internal link profiles. However, many webmasters began using unnatural strategies to earn links to their websites. The PageRank algorithm has been updated to work with other algorithms that identify and penalize sites that use abusive linkbuilding strategies. Although the web team sometimes conducts manual reviews of websites, the PageRank algorithm works so well with other Google solutions that human reviews are rarely necessary.

## Naive Bayes

The Naive Bayes algorithm is built off of Bayes' rule. It can be used for two purposes:

Identifying a discrete class when features of the class are known.

To identify features from a class.

From a layman’s perspective, this algorithm allows them to update their assumptions as new evidence comes in.

How is the Naïve Bayes algorithm intertwined with artificial intelligence? It is used to change assumptions as new data arises. This is a valuable algorithm for digital marketers conducting split tests. They rely on split testing tools that allow them to factor for new data that arises. Since patterns become more evident as new data is collected, this algorithm has played an important role in split testing and marketing automation.

## Expectation Maximization

Many probability models use weighted averages. They assume that the value of one expected outcome may be worth more than another, even if the probability of it occurring is lower. For instance, consider the two possible outcomes in the scenario below:

Outcome 1 has a 70% chance of occurring. The expected value of outcome 1 is 5.

Outcome 2 has a 30% chance of occurring. The expected value of this outcome is 12.

Using a weighted average approach, the expected value of the two outcomes are 3.5 and 3.6 respectively. The second scenario has a higher value, despite having a lower much lower probability of occurring.

The expectation maximization algorithm helps automate these analyses. Users can rely on its data mining algorithms, so they don’t have to process the data on their own.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}