This page contains 3 LDM experiments to evaluate the potential of LDM for speech classification.
Setup:
1) 13 dimensional Y[] and 20-dimensional X[]
2) features: 12MFCC + Energy
Experiment steps:
(1) train LDM using speaker [sap], use another speaker [mc] to test.
(2) train LDM using 5 speakers [cm], [if], [mp], [sap], [sp], use another speaker [mc] to test.
(3) train LDM using 5 speakers [cm], [if], [mp], [sap], [sp], use one of the speakers [sap] to test.
train using speaker [sap], test using speaker [mc]
- It failed to recognize three vowels (/aa/, /ae/, /eh/) and two nasals (/m/, /n/). However, the likelihood values of /m/ and /n/ are now in reasonable range (Previous experiment gave very bad likelihood values for /m/ and /n/ was because model parameters disorder during training).
- It succeeded to classify three fricatives (/f/, /sh/, /z/). LDM shows great potential for fricative classification.
- Correction Rate is 3/8 or 37.5%, it is reasonable because we only train the LDM models using one speaker.
train LDM using 5 speakers [cm], [if], [mp], [sap], [sp], use another speaker [mc] to test.
- After adding 4 speakers to train LDM, we can see obvious performance enhancement (much closer to diagonal comparing to the first experiment) from the confusion matrix colormap.
- For vowels, it succeeded to calssify /aa/. For /ae/, it confused with /m/ and /n/ but the likelihood values are close. For /eh/, it confused with /ae/ but the likelihood values are fairly close.
- For nasal /m/, it confused with /n/ but the likelihood values are very close.
- It succeeded to classify three fricatives (/f/, /sh/, /z/).
- Correction Rate is 4/8 or 50%. The rate is not good but we have seen the performance progress and good trend of result as we increase training dataset. We could expect a fairly better performance if we increase the training dataset to be sufficiently large.
train LDM using 5 speakers [cm], [if], [mp], [sap], [sp], use one of the speakers [sap] to test.
- In this experiment, the test phones have been involved in the training so we got good performance. It almost succeeded to recognize all the phones (Except /n/ confused to /m/). The confusion matrix colormap is close to diagnal.
- Correction Rate is 7/8 or 87.5%.
- If our training dataset is sufficiently large and the test speaker is correlated to training speakers, we could expect LDM to provide very good performance.
--
July 08, 2007 by Tao.