I have written a decision tree in Python based on Sklearn, but when I calculate the results and display the decision tree (and the 20 most important features), feature “A” is the most important and is also used as the root node.
However, when I calculate the information gain per feature and display the results in a list, feature “B” has the highest information gain (feature “A” also has a quite high information gain, but not as good as feature B). Nevertheless, feature A is used as the root node… so my question is: Have I made a programming error or is this a possible scenario (where the feature with the highest information gain is not used as the root node by definition).
In another topic someone wrote the following:
For a decision tree that uses Information Gain, the algorithm chooses
the attribute that provides the greatest Information Gain (this is
also the attribute that causes the greatest reduction in entropy).
and also (especially this part is really interesting):
Decision Tree algorithms are “greedy” in the sense that they always
select the attribute the yields the greatest Information Gain for the
current node (branch) being considered without later reconsidering the
attribute after subsequent child branches are added. So to answer your
second question: the decision tree algorithm tries to place attributes
with greatest Information Gain near the base of the tree. Note that
due to the algorithm’s greedy behavior, the decision tree algorithm
will not necessarily yield a tree that provides the maximum possible
overall reduction in entropy.
So in that case there is no reason that it would select category B instead of category A, which means that I probably made a coding mistake..?