Successful Data Science Project Planning: CRISP-DM Is Not Dead!
In this post, we explore the steps all data science projects must go through in order to find interesting insights.
Join the DZone community and get the full member experience.Join For Free
It is very difficult to find a concise article containing a comprehensive guide to implement a Machine Learning or Data Science project. There are many online articles that provide detailed information on how we need to implement parts of a Machine Learning/Data Science project. Sometimes, companies only need high-level steps that show a clear overview.
A lot of Data Science project leads today forget about CRISP-DM, which is Cross-industry standard process for data mining created in 1996. In 2015, IBM released a new methodology called Analytics Solutions Unified Method for Data Mining/Predictive Analytics (also known as ASUM-DM) which refines and extends CRISP-DM.
Machine Learning projects are used to discover patterns in data that drive a business. Good planning is required to successfully carry out a Data Science project and any other project. Below we describe 6 stages that will help in structuring Data Science projects and will lead your project to an effective end.
Data Science Project Planning
1. Understanding Business
At this stage, we focus on understanding project goals and requirements from a business perspective, and then transforming this knowledge into a definition of the data science problem.
It is important that business leaders and their project managers start to spend time clearly defining specific problems or challenges they would like to solve with the help of Data Science. The more specific the goal is, the greater the chance of successful implementation of machine learning algorithms.
For example, saying that an organization would like to "increase online sales by 25%" is not detailed enough. Instead, a more defined statement, such as "striving to increase online sales by 25% by monitoring the demographic and activity data of site visitors and predicting their actions," is much more useful in determining the goal and ensuring that it is understandable to all interested parties.
2. Understanding the Data
The goal of the stage is to prepare data and assess its suitability. The step begins with data collection and then identifying data quality problems, discovering the insight from data and interesting subsets to formulate hypotheses regarding hidden information.
Much time and effort is devoted to data collection, so organizations need to ensure that relevant data is collected in sufficient numbers. It is not enough to have data points such as age, gender, or ethnicity only. It’s much more important to have data on customer behavior, purchase trends, and activity. It is worth remembering that for a successful outcome data quality is just as important as the size of the data set. Organizations should prioritize data management procedures. Here Data Warehouse or Data Lake implementation could be useful as data will be collected, transformed, and prepared for further advanced analytics.
The goal of this stage is to understand business needs and available data. This step begins with the initial data collection and goes through identifying data quality problems, discovering the first insight into data, or discovering interesting subsets to formulate hypotheses regarding hidden information. Pandas, a high-performance Python library, can be useful at this stage. With its help you will be able to analyze data and understand main statistics.
3. Data Preparation
It may be tempting for a company to jump into the modeling exercise. However, it is important that a firm firstly conducts a quick data mining exercise in which you can verify assumptions and understand the data. This will help to determine whether the data tells the right story based on the substantive knowledge of the organization and business awareness.
Such an exercise will also help the organization understand what significant variables or functions should (or may) be, and the type of data categorization that should be created in order to use them as inputs to any potential models.
The data preparation phase includes all activities aimed at building the final data set for modeling stage from the initial raw data.
Statistical models are built, selected, and checked during this stage. Since some techniques, such as neural networks, have specific data requirements, you may need to go back to the data preparation stage.
Business experts should be involved here because their continuous feedback is crucial for validation and ensuring that all stakeholders are on the same page. Indeed, since the success of any ML model depends on the successful engineering of features, the expert will always be more valuable than the algorithm when it comes to obtaining better functions.
Once you've built one or more high-quality models based on your chosen features, test them to make sure they are generalized and standardized and that all key business issues have been sufficiently addressed. The end result is a selection of the most relevant model(s).
The definition of performance measures of the model will help in the evaluation, comparison, and analysis of the results from many algorithms, which, as a result, will help to improve specific models. The accuracy of the classification, for example the number of correct predictions divided by the total number of forecasts made and multiplied by 100, would be a good measure of the performance of the classification model.
The data will have to be divided into two sets of data: a training set on which the algorithm will be trained, and a set of tests on the basis of which it will be evaluated. Depending on the complexity of the algorithm, it can be as simple as selecting a random data division, e.g. 60% for training and 40% for testing, or it may involve more complicated sampling or cross-validation processes. There are a lot of approaches and methods to evaluate your model. You could find useful steps of evaluation in this article.
As with hypothesis testing, business experts and domains should be involved in validating results and making sure everything goes in the right direction.
6. Implementation and Deployment
After the model has been built and approved, it must be put into production. Beginning with a limited implementation for several weeks or months, in which business users can provide continuous feedback on the behavior of the model and its results, it can then be disseminated to a wider audience.
Adequate tools and platforms should be selected to automate data collection, with systems put in place to disseminate results to relevant audiences. The platform should provide multiple interfaces to account for different levels of knowledge among end users of the organization. Business analysts may want to conduct further analysis, for example on the basis of model results, whereas ordinary end-users may simply want to interact with data using dashboards and visualizations.
Essentially, this will mean the implementation of the model into the operating system in order to assess or categorize new/unknown data as it arises, and to create a mechanism to use this new information to solve the original business problem. Importantly, the code must also include all the stages of data preparation leading to modeling, so that the model treats new raw data in the same way as during model development.
There are dedicated tools that will help you to deploy your Data Science solution to production environment. You could wrap Flask around your machine learning models to serve them as a REST API. This article extends that by explaining how to productionize Flask API and get it ready for deployment using Docker.
Opinions expressed by DZone contributors are their own.