DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. How to Choose the Right ML Algorithms

How to Choose the Right ML Algorithms

In this post, you will learn about tips and techniques that can be used for choosing the right Machine Learning algorithm for your Machine Learning problem.

Ajitesh Kumar user avatar by
Ajitesh Kumar
CORE ·
Oct. 05, 18 · Opinion
Like (4)
Save
Tweet
Share
7.92K Views

Join the DZone community and get the full member experience.

Join For Free

In this post, you will learn about tips and techniques that can be used for choosing the right Machine Learning algorithms for your Machine Learning problem. These can be very useful for data scientists or ML researchers starting to learn data science/Machine Learning topics.

Based on the following, one could go for selecting different classes of Machine Learning algorithms for training the models.

  • Availability of data
  • Number of features

This post deals with the following different scenarios while explaining Machine Learning algorithms that can be used to solve related problems:

  • Large number of features, less volume of data
  • Small number of features, large volume of data
  • Large number of features, large volume of data

Large Number of Features, Less Volume of Data

For scenarios where there are a large number of features but a lesser volume of data, one could go for some of the following Machine Learning algorithms:

  • Stepwise methods
  • Lasso regression analysis
  • Support vector machine (SVM)

A larger number of features generally result in overfitting of the models. Thus, one of the key exercise in such a scenario is to do one or both of the following:

  • Remove lesser important features; One could use feature selection techniques to achieve the same.
  • Apply L1 or L2 regularization method for penalizing the weights associated with each feature.

One of the example where you would find a large number of features but a lesser volume of data is protein-to-protein interactions. In protein-to-protein interactions, the number of features can be in the order of millions, but sample size can be in the order of thousands.

Small Number of Features, Large Volume of Data

For scenarios where there are a smaller number of features but a large volume of data, one could go for some of the following Machine Learning algorithms:

  • Generalized linear models (GLM)
  • Ensemble methods such as bagging, random forest, boosting (AdaBoost)
  • Deep learning

The examples of large data could include microarrays (gene expression data), proteomics, brain images, videos, functional data, longitudinal data, high-frequency financial data, warehouse sales, among others.

Large Number of Features, Large Volume of Data

For scenarios where there are a larger number of features and a large volume of data, the primary concern becomes the computational cost for data processing and training/testing the models. The following represents some of the techniques which could be used for processing a large number of features and associated data set while building the models:

  • Random projections: A technique used to reduce the dimensionality of a set of points which lie in Euclidean space; The technique is used to reduce the features to the most important ones
  • Variable screening: Variable screening methods are used to select the most important features out of all.
  • Subsampling: With a large dataset, the computational cost savings is achieved by subsampling the data sets. The idea behind subsampling is to fit the model to the subsample and make an equally simple correction to obtain an estimate for the original data set. However, the problem arises when the subsampling fails to take into account the imbalanced data set having the class imbalance. If taken care, it could help achieve significant computational cost savings. The following represents different forms of imbalanced class data sets:
    • Marginal imbalance: The data representing one or more of the classes are very less in numbers. For example, let's say for every thousand positive examples, there are a couple of negative examples.
    • Conditional imbalance: For the most value of features set, the prediction is easier and accurate than others set of input features. In order to take care of the imbalanced class problem, the technique used is called case-control sampling.
  • Case-control sampling: Case-control sampling technique is used to gather a uniform sample for each class while adjusting the mixture of the classes. This technique is used to reduce the complexity of training a logistic regression classifier. The algorithm reduces the training complexity by selecting a small subsample of the original dataset for training. A logistic regression model fitted on the subsample can be converted to a valid model for the original population via a simple adjustment to the intercept. Standard case-control sampling still may not make the most efficient use of the data. It fails to efficiently exploit conditional imbalance in a data set that is marginally balanced.
  • MapReduce
  • Divide and conquer

Once the aspects related to a large number of features or a large volume of data set is taken care, one could appropriately use different algorithms as described above.

References

  • Local case-control sampling — Efficient subsampling in imbalanced data sets
  • Data science presentation by Trevor Hastie

Summary

In this post, you learned about the selection criteria of different Machine Learning algorithms and appropriate data processing techniques based on a number of features and volume of data. For a larger number of features and a smaller volume of data, one could go for algorithms such as SVM, lasso regression methods, stepwise methods, etc. For a smaller number of features and a larger volume of data, one could go for GLM, deep learning algorithms, ensemble methods, etc. For the larger volume of features and data, first and foremost, it is recommended to bring down the number of features to the most important features and secondly, use subsampling techniques for computational cost savings. One could then apply the appropriate ML algorithms as described in this post.

Machine learning Data science Algorithm

Published at DZone with permission of Ajitesh Kumar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Upgrade Guide To Spring Data Elasticsearch 5.0
  • Choosing the Best Cloud Provider for Hosting DevOps Tools
  • Type Variance in Java and Kotlin
  • Unleashing the Power of JavaScript Modules: A Beginner’s Guide

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: