AdaBoost Algorithm For Machine Learning
AdaBoost Algorithm For Machine Learning
What is AdaBoost? Let's check out what it is, what it does, and some examples of how to train a model.
Join the DZone community and get the full member experience.Join For Free
Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.
What Is AdaBoost?
First of all, AdaBoost is short for Adaptive Boosting. Basically, Ada Boosting was the first really successful boosting algorithm developed for binary classification. Also, it is the best starting point for understanding boosting. Moreover, modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.
Generally, AdaBoost is used with short decision trees. Further, the first tree is created, the performance of the tree on each training instance is used. Also, we use it to weight how much attention the next tree. Thus, it is created should pay attention to each training instance. Hence, training data that is hard to predict is given more weight. Although, whereas easy to predict instances are given less weight.
Learning: AdaBoost Model
Learn the AdaBoost Model from Data
- Ada Boosting is best used to boost the performance of decision trees and this is based on binary classification problems.
- AdaBoost was originally called AdaBoost.M1 by the author. More recently it may be referred to as discrete Ada Boost. As because it is used for classification rather than regression.
- AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners.
Each instance in the training dataset is weighted. The initial weight is set to weight(xi) = 1/n Where xi is the i’th training instance and n is the number of training instances.
How to Train One Model
A weak classifier is prepared on the training data using the weighted samples. Only binary classification problems are supported. So each decision stump makes one decision on one input variable. And outputs a +1.0 or -1.0 value for the first or second class value. The misclassification rate is calculated for the trained model. Traditionally, this is calculated as error = (correct – N) / N Where error is the misclassification rate. While correct is the number of training instance predicted by the model. And N is the total number of training instances.
If the model predicted 78 of 100 training instances the error. This is modified to use the weighting of the training instances: error = sum(w(i) * terror(i)) / sum(w) Which is the weighted sum of the misclassification rate. where w is the weight for training instance I terror is the prediction error for training instance i. Also, which is 1 if misclassified and 0 if correctly classified?
If we had 3 training instances with the weights 0.01, 0.5 and 0.2. The predicted values were -1, -1 and -1, and the actual output variables in the instances were -1, 1 and -1, then the terrors would be 0, 1, and 0. The misclassification rate would be calculated as: error = (0.01*0 + 0.5*1 + 0.2*0) / (0.01 + 0.5 + 0.2) or error = 0.704 A stage value is calculated for the trained model. As it provides a weighting for any predictions that the model makes. The stage value for a trained model is calculated as follows: stage = ln((1-error) / error) Where stage is the stage value used to weight predictions from the model. Also, ln() is the natural logarithm and error is the misclassification error for the model. The effect of the stage weight is that more accurate models have more weight. The training weights are updated giving more weight to predicted instances. And less weight to predicted instances.
The weight of one training instance (w) is updated using: w = w * exp(stage * terror) Where w is the weight for a specific training instance, exp() is the numerical constant e or Euler’s number raised to a power, a stage is the misclassification rate for the weak classifier and terror is the error the weak classifier made predicting the output and evaluated as: terror = 0 if(y == p), otherwise 1 Where y is the output variable for the training instance and p is the prediction from the weak learner. This has the effect of not changing the weight if the training instance was classified. Thus, making the weight slightly larger if the weak learner misclassified the instance. To learn machine learning applications, follow the below link
- Basically, weak models are added sequentially, trained using the weighted training data.
- Generally, the process continues until a pre-set number of weak learners have been created.
- Once completed, you are left with a pool of weak learners each with a stage value.
Making Predictions with AdaBoost
Predictions are made by calculating the weighted average of the weak classifiers. For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the second class is predicted.
For example: 5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an output of -0.8. And which would be an ensemble prediction of -1.0 or the second class.
Data Preparation for AdaBoost
This section lists some heuristics for best preparing your data for AdaBoost. Quality Data: Because of the ensemble method attempt to correct misclassifications in the training data. Also, you need to be careful that the training data is high-quality. Outliers: Generally, outliers will force the ensemble down the rabbit hole of work. Although, it is so hard to correct for cases that are unrealistic. These could be removed from the training dataset. Noisy Data: Basically, noisy data, specifical noise in the output variable can be problematic. But if possible, attempt to isolate and clean these from your training dataset.
We have studied the Boosting Algorithm and have learned about an Ada boost example. We have also learned about Adaboosting applications. I hope this article will help you understand the concept of Boosting — Ada boost. Furthermore, if you have any questions, feel free to ask in the comments section.
Published at DZone with permission of Rinu Gour . See the original article here.
Opinions expressed by DZone contributors are their own.