When to Use Linear Regression, Clustering, or Decision Trees
Many articles define decision trees, clustering, and linear regression, as well as the differences between them — but they often neglect to discuss where to use them.
Join the DZone community and get the full member experience.
Join For FreeThe differences between decision trees, clustering, and linear regression algorithms have been illustrated in many articles (like this one and this one). However, it's not always clear where these algorithms can be used. With this blog post, I explain where you can use these machine learning algorithms and what factors you should consider when selecting a certain algorithm for your needs.
Linear Regression Use Cases
Some uses of linear regression are:
- Sales of a product; pricing, performance, and risk parameters
- Generating insights on consumer behavior, profitability, and other business factors
- Evaluation of trends; making estimates, and forecasts
- Determining marketing effectiveness, pricing, and promotions on sales of a product
- Assessment of risk in financial services and insurance domain
- Studying engine performance from test data in automobiles
- Calculating causal relationships between parameters in biological systems
- Conducting market research studies and customer survey results analysis
- Astronomical data analysis
- Predicting house prices with the increase in sizes of houses
Some other use cases where linear regression is often put to use are stock trading, video games, sports betting, and flight time prediction.
Decision Tree Use Cases
Some uses of decision trees are:
- Building knowledge management platforms for customer service that improve first call resolution, average handling time, and customer satisfaction rates
- In finance, forecasting future outcomes and assigning probabilities to those outcomes
- Binomial option pricing predictions and real option analysis
- Customer’s willingness to purchase a given product in a given setting, i.e. offline and online both
- Product planning; for example, Gerber Products, Inc. used decision trees to decide whether to continue planning PVC for manufacturing toys or not
- General business decision-making
- Loan approval
Clustering Use Cases
Some uses of clustering algorithms are:
- Customer segmentation
- Classification of species by using their physical dimensions
- Product categorization
- Movie recommendations
- Identifying locations of putting cellular towers in a particular region
- Effective police enforcement
- Placing emergency wards considering the factor of maximum accident-prone areas in a region
- Clustering genes
- Impact of number of attributes
How to Select the Right Machine Learning Algorithm
Now that you understand use cases and where these machine learning algorithms can prove useful, let’s talk about how to select the perfect algorithm for your needs.
Linear Regression Selection Criteria
Let's talk about classification and regression capabilities, error rates, data compatibilities, data quality, computational complexity, and comprehensibility and transparency.
Classification and Regression Capabilities
Regression models predict a continuous variable, such as the sales made on a day or predict temperature of a city.
Their reliance on a polynomial (like a straight line) to fit a dataset possesses a real challenge when it comes towards building a classification capability.
Let’s imagine that you fit a line with the training points you have. Imagine you want to add another data point, but to fit it, you need to change your existing model (maybe the threshold itself, as well). This will happen with each data point that we add to the model; hence, linear regression isn’t good for classification models.
Error Rates
Linear regression is weaker than both other algorithms in terms of reducing error rates.
Data Compatibilities
Linear regression relies on continuous data to build regression capabilities.
Data Quality
Each missing value removes one data point that could optimize the regression. In simple linear regression, outliers can significantly disrupt the outcomes.
Computational Complexity
Linear regression is often not computationally expensive, compared to decision trees and clustering algorithms. The order of complexity for N training examples and X features usually falls in either O(X2), O(XN), or O(C3).
Comprehensible and Transparent
They are easily comprehensible and transparent in nature. They can be represented by simpler mathematical notations to anyone and can be understood easily.
Decision Trees Selection Criteria
Decision trees are a method for classifying subjects into known groups. They're a form of supervised learning.
The clustering algorithms can be further classified into “eager learners,” as they first build a classification model on the training data set and then actually classify the test dataset. This nature of decision trees to learn and become eager to classify unseen observations is the reason why they are called “eager learners.”
Classification and Regression Capabilities
Decision trees are compatible with both types of tasks — regression as well as classification.
Computational Efficiency
Since decision trees have in-memory classification models, they do not bring in high computation costs, as they don’t need frequent database lookups.
Arbitrary Complicated Decision Boundaries
Decision trees cannot easily model arbitrary decision boundaries.
Comprehensible and Transparent
They are extensively used by banks for loan approvals just because of their extreme transparency of rule-based decision-making.
Data Quality
Decision trees bring in the capability to handle a dataset with a high degree of errors and missing values.
Incremental Learning
With decision trees working in batches, they model one group of training observations at a time. Hence, they are unfit for incremental learning.
Error Rates
They have relatively higher error rates — but not as bad as linear regression.
Data Compatibilities
Decision trees can handle data with both numeric and nominal input attributes.
Assumptions
Decision trees are well-known for making no assumptions about spatial distribution or the classifier’s structure.
Impact of Number of Attributes
These algorithms often tend to produce wrong results if complex, humanly intangible factors are present. For example, in cases like customer segmentation, it would be very hard to imagine a decision tree returning accurate segments.
Clustering Algorithms Selection Criteria
Clustering algorithms are generally used to find out how subjects are similar on a number of different variables. They're a form of unsupervised learning.
The clustering algorithms, however, aren’t eager learners and rather directly learns from the training instances. They start processing data only after they are given a test observation to classify.
Classification and Regression Capabilities
Clustering algorithms cannot be used for regression tasks.
Data Handling Capabilities
Clustering can handle most types of datasets and ignore missing values.
Dataset Quality
They work well with both continuous and factorial data values.
Comprehensible and Transparent
Unlike decision trees, clustering algorithms often don’t bring in the same level of comprehension and transparency. Often, they require a lot of implementation-level explanations for decision-makers.
Computational Efficiency
Clustering algorithms often require frequent database lookups. Hence, they can often be computationally expensive.
Arbitrary Complicated Decision Boundaries
Because of instance-based learning, a fine-tuned clustering algorithm can easily incorporate arbitrarily complex decision boundaries.
Incremental Learning
Clustering naturally supports incremental learning and is a preferred choice, as opposed to both linear regression and decision trees.
Error Rates
Clustering error test error rates are closer to that of Bayesian classifiers.
Impact of Number of Attributes
With their ability to handle complex arbitrary boundaries, unlike decisions trees, they can handle multiple attributes and complex interactions.
I hope this helps you get started with these algorithms!
Opinions expressed by DZone contributors are their own.
Comments