Over a million developers have joined DZone.

Resolving Barriers to Running Sequential Pattern Mining Algorithms

DZone's Guide to

Resolving Barriers to Running Sequential Pattern Mining Algorithms

This is an exciting new field of focus within the big data community. However, it has introduced a new set of challenges that data scientists must overcome.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Data mining has been a major topic of discussion within the technology community for nearly 30 years, when Gregory Shapiro first coined the term. Data mining has become more prevalent in recent years, as organizations store much larger datasets and use Hadoop based-tools to extract and sort it more easily. Data mining processes have evolved to the point that they can also extract previously undetectable patterns from their datasets.

This is a concept known as sequential pattern mining. Data scientists are not only able to identify previously unrecognizable correlations between variables but can also look at chronological sequencing to infer causal relationships between different events that would otherwise be overlooked.

This is an exciting new field of focus within the big data community. However, it has introduced a new set of challenges that data scientists must overcome.

Primary Obstacles to Developing Effective Sequential Pattern Mining Processes

Data scientists that are trying to conduct sequential pattern mining often encounter the same heuristics that analysts running other regression analyses face. The biggest mistake is confusing correlation with causation.

Data scientists must be careful not to draw inaccurate conclusions from the data they evaluate. Here are some of the biggest issues they will face.

Separating Meaningful Patterns From Irrelevant Correlations

Some patterns tell a very important story that decisionmakers will need to rely on. Others have little significance to the problems they are trying to represent. Decisionmakers that draw the wrong conclusions about them can make costly errors.

It is often impossible to separate the two by evaluating the pattern in isolation of other system constraints. The pattern must be looked at in context and the right nuances need to be built into the model.

More importantly, the patterns must be pertinent to the system itself. Carl H. Mooney of Flinders University illustrates this in a series of tests he conducted for his whitepaper: Sequential Pattern Mining: Approaches and Algorithms.

“This seminal work, however, has some limitations: Given that the output was the maximal frequent sequences, some of the inferences (rules) that could be made could be construed as being of no real value. For example, a retail store would probably not be interested in knowing that a customer purchased product ‘A’ and some considerable time later purchased product ‘B.’”

Pattern Mining Needs to Be Efficient and Scalable

Evaluating dozens of terabytes of data requires a tremendous amount of computing resources and time. It can take days, weeks or even months to properly mine patterns without the right controls in place. By the time the patterns have been extracted, it may be too late to use them.

Cannot Exceed the Constraints Allowed by the System

Every system has its own limitations in place. While pattern mining, analysts must make sure the observations they make don’t lead to conclusions that will convince decisionmakers to exceed them.

Protocols for Developing a Sequential Pattern Mining Method

There are a few steps that you can take to address the concerns that arise with sequential pattern mining. Here are some of the most important controls to build into the framework.

Make Sure Chronological Arrays Are Built Into the Data Sets

Data analysts cannot make causal inferences from data without reliable time stamps. The exact time of events needs to be accurately stored to make it easily accessible in the data sets used for sequential pattern mining.

Minimize Dataset Server Downtime

A number of pitfalls can occur if the server storing your datasets is taken from the network. The biggest issue is obviously that new data will be added at a later date with the wrong timestamps, which can cause massive problems with your sequential regression analysis. You need to check provider uptime history to avoid this issue.

Use Empirically Proven Sequential Pattern Mining Methods

Some sequential pattern mining methods have more of a proven track record than others. You should consider using a priori-based and Pattern-Growth-based approaches, as there is considerable evidence supporting their effectiveness.

Make Sure That Horizontal Formatting Algorithms Are Properly Structured

Horizontal formatting applies to the structuring of the original data. You need to list the dependent and independent variables that will be used for your regression analysis. In many applications, this would be a customer ID, which is transformed by transactional time and other variables in a customer sequence database.

The exact structuring of your horizontal formatting varies by application. The important thing is to identify the dependent variables that you will be studying and making sure they are properly sorted in your data set, so they can be easily accessed for future reference.

Sequential Pattern Mining Is an Evolving Field With Complex Implications

Sequential pattern mining is a very complex field that has introduced a number of challenges for data analysts. The good news is that there are a number of methodologies that can help minimize these challenges.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

data mining ,algorithm ,big data ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}