# The Common Data Science Project Flow

# The Common Data Science Project Flow

Join the DZone community and get the full member experience.

Join For FreeHortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

While working across multiple data science projects, I observed a similar pattern across a group of strategic data science projects where a common methodology can be used. In this post, I want to sketch this methodology at a high level.

First of all, "data science" itself is a very generic term that means different things to different people. For the projects I involved, many of them target to solve a very tactical and specific problem. However, over the last few years more and more enterprises start to realize the strategic value of data. I observed a growing number of strategic data science projects were started from a very broad scope and took a top-down approach to look at the overall business operation. Along the way, the enterprise prioritize the most critical areas within their operation cycle and build sophisticated models to guide and automate the decision process.

Usually, my engagement started as a data scientist / consultant, with very little (or even no) domain knowledge. Being unfamiliar with the domain is nothing to be proud of and often slow down my initial discussion. Therefore, within a squeezed time period I need to quickly learn enough "basic domain knowledge" to facilitate the discussion smooth. On the other hand, lacking a per-conceived model enables me (or you can say force me) to look from a fresh-eye view, from which I can trim off unnecessary details from the legacies and only focus on those essential elements that contributes to the core part of the data model. It is also fun to go through the concept blending process between a data scientist and a domain expert. I force them to think in my way and they force me to think in their way. This is by far the most effective way for me to learn any new concepts.

Recently I had a discussion with a company who has a small, but very sophisticated data science team that build pricing model, and demand forecasting for their product line. I am, by no means an expert in their domain. But their problem (how to predict demand, and how to set price) is general enough across many industries. Therefore, I will use this problem as an example to illustration the major steps in the common pattern that I describe above.

### Problem Settings

Lets say a car manufacturer starts its quarterly planning process. Here are some key decisions that need to be made by the management.

- How many cars the company should produce for next year ?
- What should be the renew price of the cars ?

In this problem, the goal is to ...

maximize: "Profit_2015"

In general, I find it is a good start to look at the problem from an "optimization" angle, from which we define our goal in terms of an objective function as well as a set of constraints.

### Step 1: Identify variables and define its dependency graph

Build the dependency graph between different variables starting from the Objective function. Separate between the decision variables (where you have control) and environment variable (where you have no control).

As an illustrative example, we start from our objective function "Profit_2015" and define the dependency relationship below. Decision variable is highlighted in blue.

Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)

UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)

Demand_2015 = H(GDP_2014, PhoneSold_2014)

GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

...

Identifying these variable and their potential dependencies typically come from a well-studied theory from University, or domain experts in the industry. At this stage, we don't need to know the exact formula of the function F/G/H. We only need to capture the links between the variables. It is also ok to include a link that shouldn't have exist (ie: there is no relationship between the 2 variables in reality). However, it is not good if we miss a link (ie: fail to capture a strong, existing dependency).

This round usually involves 4 to 5 half day brainstorming sessions with the domain experts, facilitated by the data scientist/consultant who is familiar with the model building process. There may be additional investigation, background studies if the subject matter experts doesn't exist. Starting from scratch, this round can take somewhere between couple weeks to couple months

### Step 2: Define the dependency function

In this round, we want to identify the relationship between variable using formula of F(), G(), H().

**Well-Known Function **

For some relationship that is well-studied, we can use a known mathematical model.

For example, in the relationship

Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)

We can use the following Mathematical formula in a very straightforward manner

Profit = (UnitSold * Price) - (UnitProduced * Cost)

**Semi-Known Function**

However, some of the relationship is not as straightforward as that. For those relationship that we don't exactly know the formula, but can make a reasonable assumption on the shape of the formula, we can assume the relationship follows a family of models (e.g. Linear, Quadratic ... etc.), and then figure out the optimal parameters that best fit the historical data.

For example, in the relationship

Demand_2015 = H(GDP_2014, PhoneSold_2014)

Lets assume the "demand" is a linear combination of "GDP" and "Phone sold", which seems to be a reasonable assumption.

For the linear model we assume

Demand = w0 + (w1 * GDP) + (w2 * PhoneSold)

Then we feed the historical training data to a build a linear regression model and figure out what the fittest value of w0, w1, w2 should be.

**Time-Series Function**

In some cases, a variable depends only on its own past value but not other variables, here we can train a Time Series model to predict the variable based on its own past values. Typically, the model is decomposed into 3 components; Noise, Trend and Seasonality. One popular approach is to use exponential smoothing techniques such as Holt/Winters model. Another popular approach is to use the ARIMA model which decomposed the value into "Auto-Regression" and "Moving-Average".

For example, in the relationship

GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

We can use TimeSeries model to learn the relationship between the historical data to its future value.

**Completely Unknown Function**

But if we cannot even assume the model family, we can consider using "k nearest neighbor" approach to interpolate the output from its input. We need to define the "distance function" between data points based on domain knowledge and also to figure out what the optimal value of k should be. In many case, using a weighted average of the k-nearest neighbor is a good interpolation.

For example, in the relationship

UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)

It is unclear what model to be used in representing UnitSold as a function of Supply, Demand, Price and CompetitorPrice. So we go with a nearest neighbor approach.

Based on monthly sales of past 3 years, we can use "Euclidean distance" (we can also consider scaling the data to a comparable range by minus its mean and divide by its standard deviation) to find out the closest 5 neighbors, and then using the weighted average to predict the unit sold.

### Step 3: Optimization

At this point, we have the following defined

- A goal defined by maximizing (or minimizing) an objective function
- A set of variables (including the decision and environment variables)
- A set of functions that define how these variables are inter-related to each other. Some of them is defined by a mathematical formula and some of them is defined as a black-box (base on a predictive model)

**Determine the value of environment variables**

For those environment variables that has no dependencies on other variables, we can acquire their value from external data sources. For those environment variables that has dependencies on other environment variables (but not decision variables), we can estimate their value using the corresponding dependency function (of course, we need to estimate all its depending variables first). For those environment variables that has dependencies (direct or indirect) on decision variables, leave it as undefined.

**Determine the best value of decision variables**

Once we formulate the dependency function, depends on the format of these function, we can employ different optimization methods. Here is how I choose the appropriate method based on the formulation of dependency functions.

### Additional Challenges

To summarize, we have following the process below

- Define an objective function, constraints, decision variables and environment variables
- Identify the relationship between different variables
- Collect or predict those environment variables
- Optimize those decision variables based on the objective functions
- Return the optimal value of decision variables as the answer

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.

Published at DZone with permission of Ricky Ho , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}