In this project, we have evaluated our prototype recognition system on three tasks:
  • degradations in performance on continuous digit strings (TIDigits) due to speech compression;

  • real-time recognition on DARPA's Resource Management Task;

  • the Naval Research Lab's (NRL) SPeech In Noisy Environments (SPINE).
We evaluated our baseline system on coded speech generated by the Mixed Excitation Linear Prediction (MELP) and Continuously Variable Slope Delta Modulation (CVSD) algorithms. The baseline system achieved a performance of 0.4% WER on clean speech. MELP coded speech produced a WER of 0.7% while CVSD coded speech produced a WER of 2.1%. Though the relative increase in error rate is considerable for both of these algorithms, the absolute performance is still respectable. Performance is summarized below in Table 1.

Results of TIDigits Experiments on Coded Speech

Experiment
Number
Condition Data Word Error Rate
Training Test Word
Models
Xwrd CD
Models
1 Matched STUDIO STUDIO 0.4% 0.6%
2 Matched MELP MELP 0.7% 0.8%
3 Matched CVSD CVSD 2.1% 2.0%
4 Mismatched STUDIO MELP 1.1% 1.1%
5 Mismatched STUDIO CVSD 14.6% 14.2%
6 Mismatched MELP CVSD 22.2% 20.2%
7 Mismatched CVSD MELP 36.1% 48.1%
Table 1 - Results of evaluations of 8 kHz TIDigits data processed through several speech compression algorithms. These results were generated using a 16-mixture Gaussian HMM system with both word models and cross-word triphone models. The two cells that are highlighted demonstrate that recognition performance is fairly robust to the MELP coding algorithm (or equivalently, MELP does a much better job than CVSD at preserving the important aspects of the signal).

Next, we developed a real-time Resource Management recognition system that delivers a WER of 5.0% at 1.1 xRT on a 600 MHz processor. The baseline recognition system for this application, which runs about 10 xRT, has an error rate of 3.4%.

Results of Resource Management Experiments

System Word Error Rate Memory
(Mbyte)
Runtime
(xRT)
Feb89 Oct89 Feb91 Sep92 Average
Deliverables Baseline 2.9% 3.4% 2.2% 5.2% 3.4% 111 9.7
Real-time 4.3% 4.8% 3.1% 7.7% 5.0% 46 1.1
 
LM_Scale 12.0 4.1 4.5 3.4 8.1 5.0 - -
10.0 3.7 4.6 3.3 6.9 4.6 - -
6.0 3.6 4.3 3.1 7.1 4.5 - -
 
State-Tying 1378 3.6% 4.3% 3.1% 7.1% 4.5% - -
1946 2.9% 4.4% 3.1% 5.5% 4.0% - -
3554 2.8% 3.5% 2.6% 5.2% 3.5% - -
 
Beam
Pruning
250 200 200 3.9% 4.1% 2.7% 6.4% 4.3% 109 5.2
200 150 150 4.2% 4.5% 2.9% 7.7% 4.8% 49 1.3
190 160 100 4.3% 4.8% 3.1% 7.7% 5.0% 46 1.1
Table 2 - Results of our experiments with real-time systems on DARPA's Resource Management task. These results were generated on data sampled at 16 kHz using a 6-mixture Gaussian HMM system with cross-word triphone models with a bigram language model. The first two highlighted rows summarize the performance of deliverables, including a baseline system and a real-time system.

In November, we will release results on NRL's Speech in Noisy Environments (SPINE) evaluation.