cache all the states output probabilities into memory
speed up reestimation significantly since these values are accessed multiple times
xRT decreased from 6.0 to 3.6 when training monophone models with 8 Gaussian mixture components