Data Mining Process: Cross-Industry Standard Process for Data Mining
In this article, we provide a high-level overview of the data mining process, discussing topics such as data cleaning, pattern evaluation, and more.
Join the DZone community and get the full member experience.Join For Free
1. Introduction to Data Mining
Data mining is the process of discovering hidden, valuable knowledge by analyzing a large amount of data. Also, we have to store that data in different databases.
As data mining is a very important process, it is advantageous for various industries, such as manufacturing, marketing, etc. Therefore, there's a need for a standard data mining process. This data mining process must be reliable. Also, this process should be repeatable by business people with little to no knowledge of data science.
2. Stages of Data Mining Process
The data mining process is classified in two stages: Data preparation/data preprocessing and data mining.
Stages of Data Mining Process
The data preparation process includes data cleaning, data integration, data selection, and data transformation. The second phase includes data mining, pattern evaluation, and knowledge representation.
a. Data Cleaning
In the data mining process, data gets cleaned, as data in the real world is noisy, inconsistent, and incomplete.
Data cleaning includes a number of techniques, such as filling in the missing values and combined compute.
b. Data Integration
In this process, data in integrated from different data sources, as data is in different formats in different locations. We can store data in a database, text files, spreadsheets, documents, data cubes, and so on. Although, data integration is complex because normally data doesn’t match the different sources.
We use metadata to reduce errors in the data integration process. Another issue faced is data redundancy. In this case, the same data might be available in different tables in the same database. Data integration tries to reduce redundancy as much as possible without affecting the reliability of the data.
c. Data Selection
This is the process by which data relevant to the analysis is retrieved from the database. This process requires large volumes of historical data for analysis, as usually the data repository with integrated data contains much more data than actually required. From the available data, data of interest needs to be selected and stored.
d. Data Transformation
In this process, we have to transform and consolidate the data into different forms that's suitable for mining. Normally this process includes normalization, aggregation, generalization, etc.
For example, a data set available as “-5, 37, 100, 89, 78” can be transformed as “-0.05, 0.37, 1.00, 0.89, 0.78”. Here, data becomes more suitable for data mining. After data integration, the available data is ready for data mining.
e. Data Mining
In this process, we have applied methods to extract patterns from the data. Also, this mining includes several tasks, such as classification, prediction, clustering, time series analysis, and so on.
f. Pattern Evaluation
Pattern evaluation identifies the truly interesting patterns that represent knowledge based on different types of interesting measures. A pattern is considered to be interesting if it is potentially useful and easily understandable. Further, it validates some hypothesis that someone wants to confirm new data with some degree of certainty.
g. Knowledge Representation
Knowledge representation is the means by which to represent data to the user in an appealing way. This can also incluce information that's mined from the data. To generate output, different techniques need to be applied.
3. Cross-Industry Standard Process For Data Mining (CRISP-DM)
The Cross-Industry Standard Process consists of six phases that occur in a cyclical process.
Data Mining Process – Cross-Industry Standard Process
a. Business Understanding
- First, we have to understand the requirements. Then we have to find what the business requirements are.
- Next, we need to evaluate different resources and assumptions by considering other important factors.
- To achieve the business objectives we need to utilize data mining.
- Finally, we have to establish a new data mining plan to achieve both business and data mining goals. The plan should be as detailed as possible.
b. Data Understanding
- First, this phase starts with the collection of data. Toperform data collection, there are activities that need to be performed, such as data load and data integration.
- Next, the “gross” or “surface” properties of the acquired data need to be examined and reported.
- Then, we need to explore the data needs by tackling the data mining questions. That can be addressed using querying, reporting, and visualization.
- Finally, we have to examine the data quality by answering some important questions, such as:
- “Is the acquired data complete?”
- “Is there any missing values in the acquired data?”
c. Data Preparation
In this step, the data preparation process will take up to 90% of our time in the project. Also, the outcome of this step is the final data set. Once we identify the data sources, then we need to select, clean, construct, and format the data.
- First, we have to select modeling techniques that we need to use for the prepared dataset.
- Next, we have to generate a test scenario to validate the quality and validity of the model.
- Then, by using modeling tools we have to prepare one or more models on the dataset.
- Finally, these models need to be assessed by the project's stakeholders. That is to make sure that the models meet business initiatives.
- In this phase, we have to evaluate the result in the context of the business goal.
- In this phase, new business requirements can pop up, due to the new patterns discovered during the data evaluation. Gaining business insights is an iterative process in data mining. The go or no-go decision must be made in this step before the project is moved on to the deployment phase.
We need to present the information we gained through the data mining process. The information has to be represented in such a way that stakeholders can use it whenever they want. Based on the business requirements, the deployment phase could be as simple creating a report or as complex as a repeatable data mining process across the organization. In this plan for deployment, a maintenance plan also has to be created for implementation.
The final report needs to summarize the project insights and outcomes and review the project to see what needs to be improved upon.
The CRISP-DM offers a uniform framework to create documentation and guidelines. In addition, the CRISP-DM can be applied to various industries with different types of data.
Published at DZone with permission of Shailna Patidar. See the original article here.
Opinions expressed by DZone contributors are their own.