The 10 Commandments for Designing a Data Science Project
Data science projects can deliver incredible value for organizations, but they must be designed in line with core guiding principles to ensure maximum returns.
Join the DZone community and get the full member experience.Join For Free
As businesses across industries seek to improve workflows and the delivery of products and services through increased automation, there is an ever-growing demand for the adoption of more advanced data science capabilities and projects.
Artificial intelligence and machine learning can, of course, deliver great ROI — but only under the right conditions. In every instance, a data science project must be framed in the right way, both from a business and a technical point of view. To help provide this framework, I have devised the following “10 commandments” for designing a data science project.
1. Define the Problem
When approaching a problem for which data science may hold an answer, it’s imperative that the problem is defined in the most complete terms. Set aside time at the beginning of the project for this phase. Document what the problem to be solved is, which data is available to you, and what kind of solution is desired. Iterate the problem statement with the end user to ensure that the correct solution is delivered.
Getting specific when defining the problem is the key. Take the example of a fraud detection model. Instead of defining the problem as “reduce fraud,” which sets wide parameters and has no distinct start and endpoint, being more specific will guide you in solving the problem correctly and efficiently. For example, frame the problem as “flag potentially fraudulent transactions on credit card purchases before the payment goes through and alert the customer,” which provides information about what needs to be predicted, which actions need to be taken, and an appropriate time frame.
2. Don't Create a Problem Based on the Solution You Want
This kind of follows from the first commandment. It’s dangerous to say, “I want to solve this problem using a neural network,” or even, “we’ll solve this using machine learning” without understanding the data and the problem statement. Not all problems need machine learning, as rule-based approaches are often sufficient and even superior. Similarly, not all machine learning problems are suited to a neural network; there are lots of algorithms and each is good at different things. Let the solution come from the problem — not the other way around.
This, again, comes down to careful definition. Don’t jump to the solution. For example, avoid defining your solution as “I want to use deep learning to flag potentially fraudulent transactions.” Instead, frame your solution in simpler terms, such as “I want to flag potentially fraudulent transactions.”
3. Ensure That the Problem Can Be Solved
Understand that defining a problem and obtaining data does not mean that the problem can be solved. Think about current solutions, what kind of data you have, and the desired result. Could a human solve this problem using the same data given infinite time? If not, it’s likely that the problem can’t be solved using machine learning. When in doubt, consult a colleague.
In the financial world, account balance prediction is an oft-requested solution, but no person or computer could tell you what your finances will be like over the coming months. Think of when the pandemic hit; millions of people unexpectedly lost their jobs. What about when there’s a house burglary and items need to be replaced? These are things that neither a human nor an algorithm can predict.
4. Understand the Target User
The ultimate goal of any problem is to satisfy the needs of the end user by providing an appropriate solution that reduces their workload. By knowing what the end user currently has and what they lack, you can aim towards the best solution from the get-go. Does the user want an aggregate prediction, a distribution, or individual predictions? How do they want the data presented? An API might be more appropriate for a technical user but a visual dashboard for a manager. These considerations can reduce tedious formatting once the solution is finished, so it must be considered ahead of time.
5. Have Good Data That Relates to the Problem
Garbage in, garbage out. That’s a very common adage among data scientists. No matter how much data exists, if it’s not good, you can’t proceed. The data has to relate to the problem and have a sufficient number of legitimate records.
If the task requires data labels and there are none, a classification algorithm can’t work. If the data has an inconsistent structure, then future pipelines can’t work. Don’t build a garbage model just for the sake of it.
6. Have a Subject Matter Expert Available
Without understanding the problem and the data, you will inevitably make very avoidable mistakes.
By having a subject matter expert available, you can ask questions about the data (e.g. “what does it mean when this field is null?”) and the problem (e.g. “does it make sense to consider this feature?”). By checking with them along the way, you can ensure that your product will, indeed, be a solution.
7. Be Aware of Your Compute and Time Limitations
Business objectives will almost always adhere to a timeline, so consider how much time and compute power is allowed for both the training and prediction of models. Some situations require predictions nearly instantaneously and others can be done in batches at leisure. You might have large compute clusters available or perhaps models must be trained quickly using little memory. You don’t want to build a super complex neural network that has to train on a Raspberry Pi.
8. Know Upfront What Legal Restrictions Exist
In regulated industries, such as finance, there are limits on what information can be used and how transparent a model must be. Know ahead of time what data can be used freely. If a field you want is restricted, find out if it can be bucketed or anonymized in some way. Equally important is which machine learning models can be used for each task, without compromising regulatory standards. Decision trees are generally considered very transparent, for example, while neural networks are not. Slight performance decreases are often necessary to satisfy legal requirements.
9. Understand the Deployment Pipeline
Knowing how the solution will be deployed can help you streamline the development process. A few things to consider are data format, model storage location, timing, and maintenance. Is this a hosted model? Are there standard company practices you must follow? Being aware of these in the early stages of design will save a lot of time and work.
10. Don’t Reinvent the Wheel
Perhaps, most importantly of all, don’t spend time just to replicate a current solution. If a solution exists, use it. Use your time and compute power to iterate and improve on what’s available.
So, there you have it. Ten commandments to set you up for your data science projects. It doesn’t matter if you’re part of a team within a large organization or a data science lone wolf, follow these core principles and you’ll never stray far from your target. Lastly, keep your eyes peeled for my next article, in which I’ll be detailing my ten commandments for performing a data science project.
Opinions expressed by DZone contributors are their own.