My vision on Data Analysis is that there is a continuum between explanatory models on one side and predictive models on the other side. The decisions that you make during the modeling process depend on your goal. Let’s take customer churn as an example. You can ask yourself, Why are customers leaving? or you can ask yourself, Which customers are leaving? The first question has as its primary goal to explain churn, while the second question has as its primary goal to predict churn. These are two fundamentally different questions, which has implications for the decisions you take along the way. The predictive side of Data Analysis is closely related to terms like Data Mining and Machine Learning.
SPSS and SAS
When we’re looking at SPSS and SAS, both of these languages originate from the explanatory side of Data Analysis. They are developed in an academic environment where hypothesis testing plays a major role. This makes that they have significant fewer methods and techniques in comparison to R and Python. Nowadays, SAS and SPSS both have data mining tools (SAS Enterprise Miner and SPSS Modeler), however these are different tools and you’ll need extra licenses.
I have spent some time to build extensive macros in SAS EG to seamlessly create predictive models, which also does a decent job at explaining the feature importance. While a Neural Network may do a fair job at making predictions, it is extremely difficult to explain such models, let alone feature importance. The macros that I have built in SAS EG does precisely the job of explaining the features, apart from producing excellent predictions.
Open-Source Tools: R and Python
One of the major advantages of open-source tools is that the community continuously improves and increases functionality. R was created by academics who wanted their algorithms to spread as easily as possible. R has the widest range of algorithms, which makes R strong on the explanatory side and on the predictive side of Data Analysis.
Python is developed with a strong focus on (business) applications, not from an academic or statistical standpoint. This makes Python very powerful when algorithms are directly used in applications. Hence, we see that the statistical capabilities are primarily focused on the predictive side. Python is mostly used in Data Mining or Machine Learning applications where a data analyst doesn’t need to intervene. Python is therefore also strong in analyzing images and videos. Python is also the easiest language to use when using Big Data Frameworks like Spark. With the plethora of packages and ever improving functionality, Python is a very accessible tool for data scientists.
Machine Learning Models
While procedures like Logistic Regression are very good at explaining the features used in a prediction, some others like Neural Networks are not. The latter procedures may be preferred over the former when it comes to only prediction accuracy and not explaining the models. Interpreting or explaining the model becomes an issue for Neural Networks. You can’t just peek inside a deep Neural Network to figure out how it works. A network’s reasoning is embedded in the behavior of numerous simulated neurons, arranged into dozens or even hundreds of interconnected layers.
In most cases, the Product Marketing Officer may be interested in knowing the factors that are most important for a specific advertising project — what can they concentrate on to get the response rates higher rather than what their response rate or revenues in the upcoming year will be. These questions are better answered by procedures which can be interpreted in an easier way. This is a great article about the technical and ethical consequences of the lack of explanations provided by complex AI models.
Procedures like Decision Trees are very good at explaining and visualizing decision points (features and their metrics). However, those do not produce the best models. Random Forests and Boosting are the procedures that use Decision Trees as the basic starting point to build predictive models, which are by far some of the best methods to build sophisticated prediction models.
Random Forests use full-grown (highly complex) Trees and take random samples from the training set (a process called Bootstrapping) so that each split uses only a proper subset of features from the entire feature set to actually make the split rather than using all of the features. This process of Bootstrapping helps with a lower number of training data (in many cases, there is no choice to get more data).
The (proper) subsetting of the features has a tremendous effect on de-correlating the Trees grown in the Forest (hence randomizing it), leading to a drop in Test Set errors. A fresh subset of features is chosen at each step of splitting, making the method robust. The strategy also stops the strongest feature from appearing each time a split is considered, making all the trees in the forest similar. The final result is obtained by averaging the result over all trees (in the case of Regression problems) or by taking a majority class vote (in the case of classification problem).
On the other hand, Boosting is a method where a Forest is grown using Trees that are not fully grown, or in other words, with Weak Learners. One has to specify the number of trees to be grown, and the initial weights of those trees for taking a majority vote for class selection. The default weight, if not specified, is the average of the number of trees requested. At each iteration, the method fits these weak learners and finds the residuals. Then, the weight of those trees which failed to predict the correct class is increased so that those trees can concentrate better on the failed examples. This way, the method proceeds by improving the accuracy of the Boosted Trees, stopping when the improvement is below a threshold.
One particular implementation of Boosting, AdaBoost has very good accuracy over other implementations. AdaBoost uses Trees of depth 1, known as Decision Stump, as each member of the Forest. These are slightly better than random guessing to start with, but over time, they learn the pattern and perform extremely well on test sets. This method is more like a feedback control mechanism where the system learns from the errors. To address overfitting, one can use the hyper-parameter Learning Rate (lambda) by choosing values in the range: (0,1]. Very small values of lambda will take more time to converge; however, larger values may have difficulty converging. This can be achieved by an iterative process to select the correct value for lambda, plotting the test error rate against values of lambda. The value of lambda with the lowest test error should be chosen.
In all these methods, as we move from Logistic Regression to Decision Trees to Random Forests and Boosting, the complexity of the models increase, making it almost impossible to EXPLAIN the Boosting model to marketers/product managers. Decision Trees are easy to visualize. Logistic Regression results can be used to demonstrate the most important factors in a customer acquisition model and hence will be well received by business leaders. On the other hand, the Random Forest and Boosting methods are extremely good predictors, without much scope for explaining. But there is hope: These models have functions for revealing the most important variables, although it is not possible to visualize why.
Using a Balanced Approach
I use a mixed strategy. I use the previous methods as a step in Exploratory Data Analysis, present the importance of features and characteristics of the data to the business leaders in phase one, and then use the more complicated models to build the prediction models for deployment, after building competing models. That way, one not only gets to understand what is happening and why but also gets the best predictive power. In most cases, I have rarely seen a mismatch between the explanation and the predictions using different methods. After all, this is all math and the way of delivery should not change the end results.