When to Use Linear Regression, Clustering, or Decision Trees

Many articles define decision trees, clustering, and linear regression, as well as the differences between them — but they often neglect to discuss where to use them.

Parikshit Joshi

Oct. 04, 17 · Opinion

Likes (12)

Comment

Save

74.4K Views

The differences between decision trees, clustering, and linear regression algorithms have been illustrated in many articles (like this one and this one). However, it's not always clear where these algorithms can be used. With this blog post, I explain where you can use these machine learning algorithms and what factors you should consider when selecting a certain algorithm for your needs.

Linear Regression Use Cases

Some uses of linear regression are:

Sales of a product; pricing, performance, and risk parameters
Generating insights on consumer behavior, profitability, and other business factors
Evaluation of trends; making estimates, and forecasts
Determining marketing effectiveness, pricing, and promotions on sales of a product
Assessment of risk in financial services and insurance domain
Studying engine performance from test data in automobiles
Calculating causal relationships between parameters in biological systems
Conducting market research studies and customer survey results analysis
Astronomical data analysis
Predicting house prices with the increase in sizes of houses

Some other use cases where linear regression is often put to use are stock trading, video games, sports betting, and flight time prediction.

Decision Tree Use Cases

Some uses of decision trees are:

Building knowledge management platforms for customer service that improve first call resolution, average handling time, and customer satisfaction rates
In finance, forecasting future outcomes and assigning probabilities to those outcomes
Binomial option pricing predictions and real option analysis
Customer’s willingness to purchase a given product in a given setting, i.e. offline and online both
Product planning; for example, Gerber Products, Inc. used decision trees to decide whether to continue planning PVC for manufacturing toys or not
General business decision-making
Loan approval

Clustering Use Cases

Some uses of clustering algorithms are:

Customer segmentation
Classification of species by using their physical dimensions
Product categorization
Movie recommendations
Identifying locations of putting cellular towers in a particular region
Effective police enforcement
Placing emergency wards considering the factor of maximum accident-prone areas in a region
Clustering genes
Impact of number of attributes

How to Select the Right Machine Learning Algorithm

Now that you understand use cases and where these machine learning algorithms can prove useful, let’s talk about how to select the perfect algorithm for your needs.

Linear Regression Selection Criteria

Let's talk about classification and regression capabilities, error rates, data compatibilities, data quality, computational complexity, and comprehensibility and transparency.

Classification and Regression Capabilities

Regression models predict a continuous variable, such as the sales made on a day or predict temperature of a city.

Their reliance on a polynomial (like a straight line) to fit a dataset possesses a real challenge when it comes towards building a classification capability.

Let’s imagine that you fit a line with the training points you have. Imagine you want to add another data point, but to fit it, you need to change your existing model (maybe the threshold itself, as well). This will happen with each data point that we add to the model; hence, linear regression isn’t good for classification models.

Error Rates

Linear regression is weaker than both other algorithms in terms of reducing error rates.

Data Compatibilities

Linear regression relies on continuous data to build regression capabilities.

Data Quality

Each missing value removes one data point that could optimize the regression. In simple linear regression, outliers can significantly disrupt the outcomes.

Computational Complexity

Linear regression is often not computationally expensive, compared to decision trees and clustering algorithms. The order of complexity for N training examples and X features usually falls in either O(X²), O(XN), or O(C³).

Comprehensible and Transparent

They are easily comprehensible and transparent in nature. They can be represented by simpler mathematical notations to anyone and can be understood easily.

Decision Trees Selection Criteria

Decision trees are a method for classifying subjects into known groups. They're a form of supervised learning.

The clustering algorithms can be further classified into “eager learners,” as they first build a classification model on the training data set and then actually classify the test dataset. This nature of decision trees to learn and become eager to classify unseen observations is the reason why they are called “eager learners.”

Classification and Regression Capabilities

Decision trees are compatible with both types of tasks — regression as well as classification.

Computational Efficiency

Since decision trees have in-memory classification models, they do not bring in high computation costs, as they don’t need frequent database lookups.

Arbitrary Complicated Decision Boundaries

Decision trees cannot easily model arbitrary decision boundaries.

Comprehensible and Transparent

They are extensively used by banks for loan approvals just because of their extreme transparency of rule-based decision-making.

Data Quality

Decision trees bring in the capability to handle a dataset with a high degree of errors and missing values.

Incremental Learning

With decision trees working in batches, they model one group of training observations at a time. Hence, they are unfit for incremental learning.

Error Rates

They have relatively higher error rates — but not as bad as linear regression.

Data Compatibilities

Decision trees can handle data with both numeric and nominal input attributes.

Assumptions

Decision trees are well-known for making no assumptions about spatial distribution or the classifier’s structure.

Impact of Number of Attributes

These algorithms often tend to produce wrong results if complex, humanly intangible factors are present. For example, in cases like customer segmentation, it would be very hard to imagine a decision tree returning accurate segments.

Clustering Algorithms Selection Criteria

Clustering algorithms are generally used to find out how subjects are similar on a number of different variables. They're a form of unsupervised learning.

The clustering algorithms, however, aren’t eager learners and rather directly learns from the training instances. They start processing data only after they are given a test observation to classify.

Classification and Regression Capabilities

Clustering algorithms cannot be used for regression tasks.

Data Handling Capabilities

Clustering can handle most types of datasets and ignore missing values.

Dataset Quality

They work well with both continuous and factorial data values.

Comprehensible and Transparent

Unlike decision trees, clustering algorithms often don’t bring in the same level of comprehension and transparency. Often, they require a lot of implementation-level explanations for decision-makers.

Computational Efficiency

Clustering algorithms often require frequent database lookups. Hence, they can often be computationally expensive.

Arbitrary Complicated Decision Boundaries

Because of instance-based learning, a fine-tuned clustering algorithm can easily incorporate arbitrarily complex decision boundaries.

Incremental Learning

Clustering naturally supports incremental learning and is a preferred choice, as opposed to both linear regression and decision trees.

Error Rates

Clustering error test error rates are closer to that of Bayesian classifiers.

Impact of Number of Attributes

With their ability to handle complex arbitrary boundaries, unlike decisions trees, they can handle multiple attributes and complex interactions.

I hope this helps you get started with these algorithms!

Machine learning Linear regression Tree (data structure) Algorithm Test data

Opinions expressed by DZone contributors are their own.

Related

Trending

When to Use Linear Regression, Clustering, or Decision Trees

Many articles define decision trees, clustering, and linear regression, as well as the differences between them — but they often neglect to discuss where to use them.

Linear Regression Use Cases

Decision Tree Use Cases

Clustering Use Cases

How to Select the Right Machine Learning Algorithm

Linear Regression Selection Criteria

Classification and Regression Capabilities

Error Rates

Data Compatibilities

Data Quality

Computational Complexity

Comprehensible and Transparent

Decision Trees Selection Criteria

Classification and Regression Capabilities

Computational Efficiency

Arbitrary Complicated Decision Boundaries

Comprehensible and Transparent

Data Quality

Incremental Learning

Error Rates

Data Compatibilities

Assumptions

Impact of Number of Attributes

Clustering Algorithms Selection Criteria

Classification and Regression Capabilities

Data Handling Capabilities

Dataset Quality

Comprehensible and Transparent

Computational Efficiency

Arbitrary Complicated Decision Boundaries

Incremental Learning

Error Rates

Impact of Number of Attributes

Related

Partner Resources