DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Process Mining Key Elements
  • Difference Between Data Mining and Data Warehousing
  • Difference Between Data Mining and Data Warehousing
  • Role of Artificial Intelligence for Government

Trending

  • Apache Doris vs Elasticsearch: An In-Depth Comparative Analysis
  • Solid Testing Strategies for Salesforce Releases
  • Contextual AI Integration for Agile Product Teams
  • Simplify Authorization in Ruby on Rails With the Power of Pundit Gem
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Data Mining Process: Cross-Industry Standard Process for Data Mining

Data Mining Process: Cross-Industry Standard Process for Data Mining

In this article, we provide a high-level overview of the data mining process, discussing topics such as data cleaning, pattern evaluation, and more.

By 
Shailna Patidar user avatar
Shailna Patidar
·
Aug. 13, 18 · Analysis
Likes (3)
Comment
Save
Tweet
Share
36.4K Views

Join the DZone community and get the full member experience.

Join For Free

1. Introduction to Data Mining

Data mining is the process of discovering hidden, valuable knowledge by analyzing a large amount of data. Also, we have to store that data in different databases.

As data mining is a very important process, it is advantageous for various industries, such as manufacturing, marketing, etc. Therefore, there's a need for a standard data mining process. This data mining process must be reliable. Also, this process should be repeatable by business people with little to no knowledge of data science.

2. Stages of Data Mining Process

The data mining process is classified in two stages: Data preparation/data preprocessing and data mining.

Data Mining Process - Stages

Stages of Data Mining Process

The data preparation process includes data cleaning, data integration, data selection, and data transformation. The second phase includes data mining, pattern evaluation, and knowledge representation.

a. Data Cleaning

In the data mining process, data gets cleaned, as data in the real world is noisy, inconsistent, and incomplete.

Data cleaning includes a number of techniques, such as filling in the missing values and combined compute.

b. Data Integration

In this process, data in integrated from different data sources, as data is in different formats in different locations. We can store data in a database, text files, spreadsheets, documents, data cubes, and so on. Although, data integration is complex because normally data doesn’t match the different sources.

We use metadata to reduce errors in the data integration process. Another issue faced is data redundancy. In this case, the same data might be available in different tables in the same database. Data integration tries to reduce redundancy as much as possible without affecting the reliability of the data.

c. Data Selection

This is the process by which data relevant to the analysis is retrieved from the database. This process requires large volumes of historical data for analysis, as usually the data repository with integrated data contains much more data than actually required. From the available data, data of interest needs to be selected and stored.

d. Data Transformation

In this process, we have to transform and consolidate the data into different forms that's suitable for mining. Normally this process includes normalization, aggregation, generalization, etc.

For example, a data set available as “-5, 37, 100, 89, 78” can be transformed as “-0.05, 0.37, 1.00, 0.89, 0.78”. Here, data becomes more suitable for data mining. After data integration, the available data is ready for data mining.

e. Data Mining

In this process, we have applied methods to extract patterns from the data. Also, this mining includes several tasks, such as classification, prediction, clustering, time series analysis, and so on.

f. Pattern Evaluation

Pattern evaluation identifies the truly interesting patterns that represent knowledge based on different types of interesting measures. A pattern is considered to be interesting if it is potentially useful and easily understandable. Further, it validates some hypothesis that someone wants to confirm new data with some degree of certainty.

g. Knowledge Representation

Knowledge representation is the means by which to represent data to the user in an appealing way. This can also incluce information that's mined from the data. To generate output, different techniques need to be applied.

3. Cross-Industry Standard Process For Data Mining (CRISP-DM)

The Cross-Industry Standard Process consists of six phases that occur in a cyclical process.

Data Mining Process - Cross-Industry Standard Process

Data Mining Process – Cross-Industry Standard Process

a. Business Understanding

  • First, we have to understand the requirements. Then we have to find what the business requirements are. 
  • Next, we need to evaluate different resources and assumptions by considering other important factors.
  • To achieve the business objectives we need to utilize data mining.
  • Finally, we have to establish a new data mining plan to achieve both business and data mining goals. The plan should be as detailed as possible.

b. Data Understanding

  • First, this phase starts with the collection of data. Toperform data collection, there are activities that need to be performed, such as data load and data integration.
  • Next, the “gross” or “surface” properties of the acquired data need to be examined and reported.
  • Then, we need to explore the data needs by tackling the data mining questions. That can be addressed using querying, reporting, and visualization.
  • Finally, we have to examine the data quality by answering some important questions, such as:
    • “Is the acquired data complete?”
    • “Is there any missing values in the acquired data?”

c. Data Preparation

  • In this step, the data preparation process will take up to 90% of our time in the project. Also, the outcome of this step is the final data set. Once we identify the data sources, then we need to select, clean, construct, and format the data. 

d. Modeling

  • First, we have to select modeling techniques that we need to use for the prepared dataset.
  • Next, we have to generate a test scenario to validate the quality and validity of the model.
  • Then, by using modeling tools we have to prepare one or more models on the dataset.
  • Finally, these models need to be assessed by the project's stakeholders. That is to make sure that the models meet business initiatives.

e. Evaluation

  • In this phase, we have to evaluate the result in the context of the business goal.
  • In this phase, new business requirements can pop up, due to the new patterns discovered during the data evaluation. Gaining business insights is an iterative process in data mining. The go or no-go decision must be made in this step before the project is moved on to the deployment phase.

f. Deployment

  • We need to present the information we gained through the data mining process. The information has to be represented in such a way that stakeholders can use it whenever they want. Based on the business requirements, the deployment phase could be as simple creating a report or as complex as a repeatable data mining process across the organization. In this plan for deployment, a maintenance plan also has to be created for implementation.

  • The final report needs to summarize the project insights and outcomes and review the project to see what needs to be improved upon.

  • The CRISP-DM offers a uniform framework to create documentation and guidelines. In addition, the CRISP-DM can be applied to various industries with different types of data.

Data mining

Published at DZone with permission of Shailna Patidar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Process Mining Key Elements
  • Difference Between Data Mining and Data Warehousing
  • Difference Between Data Mining and Data Warehousing
  • Role of Artificial Intelligence for Government

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!