DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Explanatory vs. Predictive Models in Machine Learning

Explanatory vs. Predictive Models in Machine Learning

Using a mixed approach and balancing a variety of Machine Learning models can help you understand what is happening and why and can provide strong predictive power.

Agnijit Das Gupta user avatar by
Agnijit Das Gupta
·
Apr. 26, 17 · Opinion
Like (0)
Save
Tweet
Share
11.38K Views

Join the DZone community and get the full member experience.

Join For Free

My vision on Data Analysis is that there is a continuum between explanatory models on one side and predictive models on the other side. The decisions that you make during the modeling process depend on your goal. Let’s take customer churn as an example. You can ask yourself, Why are customers leaving? or you can ask yourself, Which customers are leaving? The first question has as its primary goal to explain churn, while the second question has as its primary goal to predict churn. These are two fundamentally different questions, which has implications for the decisions you take along the way. The predictive side of Data Analysis is closely related to terms like Data Mining and Machine Learning.

SPSS and SAS

When we’re looking at SPSS and SAS, both of these languages originate from the explanatory side of Data Analysis. They are developed in an academic environment where hypothesis testing plays a major role. This makes that they have significant fewer methods and techniques in comparison to R and Python. Nowadays, SAS and SPSS both have data mining tools (SAS Enterprise Miner and SPSS Modeler), however these are different tools and you’ll need extra licenses.

I have spent some time to build extensive macros in SAS EG to seamlessly create predictive models, which also does a decent job at explaining the feature importance. While a Neural Network may do a fair job at making predictions, it is extremely difficult to explain such models, let alone feature importance. The macros that I have built in SAS EG does precisely the job of explaining the features, apart from producing excellent predictions.

Open-Source Tools: R and Python

One of the major advantages of open-source tools is that the community continuously improves and increases functionality. R was created by academics who wanted their algorithms to spread as easily as possible. R has the widest range of algorithms, which makes R strong on the explanatory side and on the predictive side of Data Analysis.

Python is developed with a strong focus on (business) applications, not from an academic or statistical standpoint. This makes Python very powerful when algorithms are directly used in applications. Hence, we see that the statistical capabilities are primarily focused on the predictive side. Python is mostly used in Data Mining or Machine Learning applications where a data analyst doesn’t need to intervene. Python is therefore also strong in analyzing images and videos. Python is also the easiest language to use when using Big Data Frameworks like Spark. With the plethora of packages and ever improving functionality, Python is a very accessible tool for data scientists.

Machine Learning Models

While procedures like Logistic Regression are very good at explaining the features used in a prediction, some others like Neural Networks are not. The latter procedures may be preferred over the former when it comes to only prediction accuracy and not explaining the models. Interpreting or explaining the model becomes an issue for Neural Networks. You can’t just peek inside a deep Neural Network to figure out how it works. A network’s reasoning is embedded in the behavior of numerous simulated neurons, arranged into dozens or even hundreds of interconnected layers. 

In most cases, the Product Marketing Officer may be interested in knowing the factors that are most important for a specific advertising project — what can they concentrate on to get the response rates higher rather than what their response rate or revenues in the upcoming year will be. These questions are better answered by procedures which can be interpreted in an easier way. This is a great article about the technical and ethical consequences of the lack of explanations provided by complex AI models.

Procedures like Decision Trees are very good at explaining and visualizing decision points (features and their metrics). However, those do not produce the best models. Random Forests and  Boosting are the procedures that use Decision Trees as the basic starting point to build predictive models, which are by far some of the best methods to build sophisticated prediction models.

Random Forests use full-grown (highly complex) Trees and take random samples from the training set (a process called Bootstrapping) so that each split uses only a proper subset of features from the entire feature set to actually make the split rather than using all of the features. This process of Bootstrapping helps with a lower number of training data (in many cases, there is no choice to get more data).

The (proper) subsetting of the features has a tremendous effect on de-correlating the Trees grown in the Forest (hence randomizing it), leading to a drop in Test Set errors. A fresh subset of features is chosen at each step of splitting, making the method robust. The strategy also stops the strongest feature from appearing each time a split is considered, making all the trees in the forest similar. The final result is obtained by averaging the result over all trees (in the case of Regression problems) or by taking a majority class vote (in the case of classification problem).

On the other hand, Boosting is a method where a Forest is grown using Trees that are not fully grown, or in other words, with Weak Learners. One has to specify the number of trees to be grown, and the initial weights of those trees for taking a majority vote for class selection. The default weight, if not specified, is the average of the number of trees requested. At each iteration, the method fits these weak learners and finds the residuals. Then, the weight of those trees which failed to predict the correct class is increased so that those trees can concentrate better on the failed examples. This way, the method proceeds by improving the accuracy of the Boosted Trees, stopping when the improvement is below a threshold.

One particular implementation of Boosting, AdaBoost has very good accuracy over other implementations. AdaBoost uses Trees of depth 1, known as Decision Stump, as each member of the Forest. These are slightly better than random guessing to start with, but over time, they learn the pattern and perform extremely well on test sets. This method is more like a feedback control mechanism where the system learns from the errors. To address overfitting, one can use the hyper-parameter Learning Rate (lambda) by choosing values in the range: (0,1]. Very small values of lambda will take more time to converge; however, larger values may have difficulty converging. This can be achieved by an iterative process to select the correct value for lambda, plotting the test error rate against values of lambda. The value of lambda with the lowest test error should be chosen.

In all these methods, as we move from Logistic Regression to Decision Trees to Random Forests and Boosting, the complexity of the models increase, making it almost impossible to EXPLAIN the Boosting model to marketers/product managers. Decision Trees are easy to visualize. Logistic Regression results can be used to demonstrate the most important factors in a customer acquisition model and hence will be well received by business leaders. On the other hand, the Random Forest and Boosting methods are extremely good predictors, without much scope for explaining. But there is hope: These models have functions for revealing the most important variables, although it is not possible to visualize why. 

Using a Balanced Approach

I use a mixed strategy. I use the previous methods as a step in Exploratory Data Analysis, present the importance of features and characteristics of the data to the business leaders in phase one, and then use the more complicated models to build the prediction models for deployment, after building competing models. That way, one not only gets to understand what is happening and why but also gets the best predictive power. In most cases, I have rarely seen a mismatch between the explanation and the predictions using different methods. After all, this is all math and the way of delivery should not change the end results.

Machine learning Big data Tree (data structure) neural network Open source Data mining Forest (application) Python (language)

Published at DZone with permission of Agnijit Das Gupta. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 5 Best Python Testing Frameworks
  • Low-Code Development: The Future of Software Development
  • Is DevOps Dead?
  • Seamless Integration of Azure Functions With SQL Server: A Developer's Perspective

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: