The 10 Commandments for Performing a Data Science Project
To truly deliver against a well-established brief, here are 10 guiding principles for performing a data science project.
Join the DZone community and get the full member experience.Join For Free
In designing a data science project, establishing what we, or the users we are building models for, want to achieve is vital, but this understanding only provides a blueprint for success. To truly deliver against a well-established brief, data science teams must follow best practices in executing the project. To help establish what that might mean, I have come up with ten points to provide a framework that can be applied to any data science project.
1. Understand the Problem
The most fundamental part of solving any problem is knowing exactly what problem you are solving. Make sure that you understand what you are trying to predict, any constraints, and what the ultimate purpose for this project will be. Ask questions early on and validate your understanding with peers, domain experts, and end-users. If you find that answers are aligning with your understanding, you know that you are on the right path.
2. Know Your Data
By knowing what your data means, you’ll be able to understand what kind of models work well and which features to use. The problem behind the data will affect which model will be most successful and the computational time will influence the cost of the project. By using and creating meaningful features, you can mimic or improve upon human decision-making. Understanding what each field means is important to the problem, especially in regulated industries, where data may need to be anonymized and thus not quite clear. Check with a domain expert if you are unclear on what something means.
3. Split Your Data
How will your model perform on unseen data? It doesn’t matter how well it performs on the data you are given if it can’t generalize to new data. By not letting your model see part of the data while training, you are able to validate how well it will perform on unknown. This method is crucial to choosing the correct model architecture and tuning parameters to get the best performance.
For supervised learning problems, you will need to split your data into two or three parts. The training data – the data that the model learns from – typically is 75-80% of the original data, chosen at random. The testing data – the data by which you evaluate your model – is the remaining data. Depending on the type of model you are building, you may also need a third hold-out set called the validation set, which is used to compare multiple supervised learning models that have been tuned on the test data. In this case, you will need to split the non-training data into two data sets, the testing and the validation. You want to compare iterations of the same model using the test data and compare the final versions of different models using the validation data.
In Python, the easiest way to split your data correctly is by using Scikit-learn’s train_test_split function.
4. Don't Leak Test Data
It is important not to feed any information from the test data into your model. This can be as obvious as training on your whole data set or as subtle as performing transformations – such as scaling – before splitting. For example, if you normalize your data prior to splitting, the model is gaining information about the test data set since the global minimum or maximum may be in the held-out data.
5. Use the Correct Evaluation Metrics
Since every problem is different, the appropriate method of evaluation must be chosen based on the context. The most naïve – and perhaps dangerous – classification metric is accuracy. Consider the problem of detecting cancer. If we want a pretty accurate model, we should just always predict “not cancer” since more than 99 percent of the time we’ll be right. However, this isn’t a very helpful model since we actually want to detect cancer. Take care in considering which evaluation metric to use in your classification and regression problems.
6. Keep It Simple
When approaching a problem, it’s important to choose the right solution for the job, not the most complicated model. Management, customers, and even you might want to use the “latest-and-greatest.” You need to use the simplest model that meets your needs, a principle called Occam’s Razor. Not only will this provide more visibility and shorten training times, but it can actually improve performance. In short, don’t shoot a fly with a bazooka or try to kill Godzilla with a flyswatter.
7. Don't Overfit (or Underfit) Your Model
Overfitting, also known as variance, leads to poor performance on data the model hasn’t seen. The model is simply memorizing the training data. Underfitting, also known as bias, is giving the model too little information to learn a correct representation of the problem. Balancing these two – commonly referred to as the “bias-variance trade-off” – is an important part of the data science process, and different problems require a different balance.
Let’s take a simple image classifier as an example. Its task is to classify whether there is a dog in an image or not. If you overfit this model, it won’t be able to identify an image as a dog unless it has seen that exact image before. If you underfit the model, it might not recognize an image as a dog even if it has seen that particular image before.
8. Try Different Model Architectures
Most of the time, it is beneficial to consider different model architectures for a problem. What works best for one problem may not be great for another. Try a mix of simple and complicated algorithms. For example, if performing a classification model, try things as simple as a random forest and as complex as a neural network. Interestingly, extreme gradient boosting (XGBoost) often far outperforms a neural network classifier. A simple problem is often best solved with a simple model.
9. Tune Your Hyperparameters
Hyperparameters are values used in the model’s calculation. For example, one hyperparameter of a decision tree is the depth of the tree, i.e., how many questions it’ll ask before deciding on an answer. The default hyperparameters for a model are those that, on average, give the best performance. But it is highly unlikely that your model is going to sit right at that sweet spot; your model can perform a lot better if different parameters are selected. The most common methods for tuning hyperparameters are grid search, randomized search, and Bayesian-optimized search, but there are a number of other more advanced techniques.
10. Compare Models Correctly
The ultimate goal of machine learning is to develop a model that generalizes well. That is why it is so important to compare and select the best model correctly. As mentioned above, you’ll want to use a different holdout set than the one with which you trained your hyperparameters for evaluation. Additionally, you’ll want to use appropriate statistical tests to evaluate the results.
Now that you have guiding principles for performing a data science project, try them out on your next data science project. I’d be interested to know if they helped you, so let me know if they did, or if they didn’t. Please add any of your own commandments in the comments below!
Opinions expressed by DZone contributors are their own.