Decision trees occupy an important place in machine learning. They form the basis of Random Forests, which are used extensively in real world systems. A famous example of this is Microsoft Kinect where Random Forests are used to track your body parts. The reason this technique is so popular is because it provides high accuracy with relatively little effort. They are fast to train and they are not computationally expensive. They are not sensitive to outliers either, which helps them to be robust in a variety of cases. So how exactly are they constructed? How are the nodes in the trees generated so that they are optimal?
What are decision trees?
Decision trees are machine learning models that are in the form of tree structures. Each non-leaf node in this tree is basically a decision maker. These nodes are called decision nodes. Each node carries out a specific test to determine where to go next. Depending on the outcome, you either go to the left branch or the right branch of this node. We keep doing this until we reach a leaf node. If we are constructing a classifier, each leaf node represents a class. Let’s say you are trying to determine whether or not it’s going to rain tomorrow based on three factors available today — temperature, pressure, and wind. So a typical decision tree would look like this:
Bear in mind that the values used in this decision tree are completely arbitrary and are not indicative of the actual weather conditions. But you get the idea, right? Okay let’s go to the next step.
How can we automatically construct this tree?
This is the tricky part! Understanding the concept of decision trees is fairly simple, but how do we construct the optimal tree? What attribute should be at the root node? How do we decide the thresholds? To understand this, we must first discuss the concept of entropy.
Entropy is basically a measure of uncertainty. As we move from top of the tree towards the bottom, we go from complete uncertainty to complete certainty. When we start at the top, we don’t know anything about the input datapoint. When we go to the leaf node, we know exactly what class the datapoint belongs to. So a good way to construct the decision tree would be to reduce the uncertainty at each level. In other words, we need to reduce the entropy with each passing level in the tree.
How to measure the entropy?
In order to reduce the entropy, we first need to know how to measure it. Entropy is defined as:
entropy(X) = -∑pilog2pi
where pi refers to the probability of i occurring in the dataset and the summation is over all the distinct symbols in the dataset.
To get an idea about its behavior, let’s consider a dataset consisting of 60 items. In this dataset, there are 3 types of items. Now if each item appears 20 times in this dataset, the entropy is high because there is a lot of uncertainty. As in, if you randomly pick an item, there is an equal chance of getting any of the three items. On the other hand, if the first item appears 58 times and the remaining two items appear 1 time each, then the uncertainty is really low. If you randomly pick a sample, you can say with some certainty that it’s going to be the first item. So in order to reduce the entropy, we need the dataset to be skewed in some way.
How to reduce the entropy in the decision tree?
Let’s consider the example of the dataset with 60 items where the first item appears 14 times, the second item appears 27 times, and the third item appears 19 times. Let’s measure the entropy:
entropy = - (14/60)*log(14/60) - (27/60)*log(27/60) - (19/60)log(19/60) = 1.537
In a decision tree, we need to split this dataset into two parts so that we reduce the overall entropy. Here are the splits:
left subtree = [5, 21, 11] right subtree = [9, 6, 8]
Let’s measure the entropy:
left entropy = - (5/37)log(5/37) - (21/37)log(21/37) - (11/37)log(11/37) = 1.374 right entropy = - (9/23)log(9/23) - (6/23)log(6/23) - (8/23)log(8/23) = 1.565
In order to measure the overall entropy of the two subtrees, we need to take the weighted summation based on the number of items in each subtree. The left subtree has 37 items and the right subtree has 23 items:
overall entropy = (37/60)*1.374 + (23/60)*1.565 = 1.447
Let’s compute the difference between the initial entropy and the final entropy we just computed. This is referred to as “information gain”:
information gain = 1.537 - 1.447 = 0.09
So this tells us that if we split the dataset in this particular way, we will obtain an information gain of 0.09. At each node, we do this for every feature in the feature vector and greedily pick the one that gives the most information gain. This is how we decide what attribute goes at the top of the decision tree and what comes after that. Each feature splits the dataset in a different way, so we need to check all of them to pick the maximum. Computing this is computationally inexpensive, hence decision trees are very quick to train.