SPLITTING CRITERIA
To split data at a node, we need to find the question
that results in the greatest entropy reduction (removes
uncertainty in the data):
In speech recognition, we can show this amounts to maximizing
the increase in likelihood:
dL = L(parent) - L(left child) - L(right child)
These likelihoods can be computed from the state occupancies
computed during training (see
decision tree-based state tying
for a detailed derivation and the important references).