AdaBoost is short for Adaptive Boosting. It is basically a machine learning algorithm that is used as a classifier. Whenever you have a large amount of data and you want divide it into different categories, we need a good classification algorithm to do it. We usually use AdaBoost in conjunction with other learning algorithms to improve their performance. Hence the word ‘boosting’, as in it boosts other algorithms! Boosting is a general method for improving the accuracy of any given learning algorithm. So obviously, adaptive boosting refers to a boosting algorithm that can adjust itself to changing scenarios. But why do those algorithms need AdaBoost in the first place? Can AdaBoost function by itself?
Why do we need it?
Let’s say you are given the task of listing out all the people in the city of San Francisco who are taller than 5’7″, weigh less than 190 lbs, and are between the ages of 28 and 41. Now the problem is that you are supposed to do this without the help of machines. All you are allowed to do is take a look at the person and determine whether or not that person qualifies. How do we do it?
You may or may not be good at estimating these parameters just by looking at the person. So to improve the accuracy, you get three people to help you out. The first person is really good at guessing the height, the second person is really good at guessing the weight and the third person is really good at guessing the age. Individually, they may not be all that useful to you, because they can do only one simple task. But if you combine them together and filter out all the people, you have a very good chance of getting an accurate list of people who qualify. Wouldn’t you agree? This is the concept behind AdaBoost.
Yet another example
Let’s say you are a horse-racing gambler and your aim is to maximize your winnings. You decide to create a computer program that will accurately predict the winner of a horse race based on the usual information. This information can be the number of races recently won by each horse, weather conditions, betting odds for each horse, etc. Your program will take all these parameters into account and predict the winner for each race. To create such a program, you will ask a highly successful gambler to explain his betting strategy. But somehow, the expert is unable to articulate a grand set of rules for selecting a horse. He just considers all these factors subconsciously and goes with the best guess.
When you present all the data for a specific set of races, the expert can easily come up with a rule of thumb for that set of races. For example, he can ask you to bet on the horse that has recently won the most races, or may be bet on the horse with the most favored odds. Again, such a rule of thumb is obviously very rough and inaccurate by itself. It is not unreasonable to expect it to provide predictions that are at least a little bit better than random guessing. If you keep asking the expert’s opinion on different collections of races, you can extract many rules of thumb. Now if your program takes all these rules of thumb into account, there is a much better chance of coming up with a good predictor. Right?
What exactly is AdaBoost?
AdaBoost was formulated by Yoav Freund and Robert Schapire. Problems in machine learning tend to suffer from the curse of dimensionality. What it means is that a single feature point ends up having a huge dimensionality. As in, the feature vector used to describe a particular thing can be very huge. So if each sample consists of a huge number of potential features, the overall system becomes very slow. This is the reason we cannot use a powerful model with a full feature set because it cannot run in real time. We can only afford to have simple machine learning algorithms. But if the algorithms are too simple, they tend to be less accurate. They are called ‘weak learners’. So we cascade them together to create a strong classifier. The output of ‘weak learners’ is combined into a weighted sum that represents the final output of the boosted classifier.
How is it “adaptive”?
AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. When we collect the set of weak learners, we give equal weights to all of them. But on each round, the weights of incorrectly classified examples are increased so that the weak learner is forced to focus on the hard examples in the training set. If you consider our earlier example of classifying people in San Francisco, it’s like saying that you trust all three people equally. But with each passing day, you realize that the second person is more accurate than the other two. So you give more weightage to his guesses and less weightage to the other two. This way, the overall accuracy will increase. Basically, the weak learner’s job is the find a weak hypothesis that is appropriate for the current distribution of the data. If you want to learn more about the underlying formulation, just google it and you will find a lot of research papers.