Aurora - Overview

State-of-the-art speech recognition systems have achieved low error rates on medium complexity tasks such as Wall Street Journal (WSJ) which involve clean data. However, the performance of these systems rapidly degrades as the background noise level increases. With the growing popularity of low-bandwidth miniature communication devices such as cell phones, palm computers, and smart pagers, a much greater demand is being created for robust voice interfaces. Speech recognition systems are now required to perform at a near-zero error rate under various noise conditions. Further, since many of these portable devices use lossy compression to conserve bandwidth, speech recognition system performance must also not degrade when subjected to compression, packet loss, and other common wireless communication system artifacts. Speech coding, for example, is known to have a negative effect on the accuracy of speech recognition systems.

Aurora, a working group of ETSI, has been formed to address many of the issues involved in using speech recognition in mobile environments. Aurora's main task is the development of a distributed speech recognition (DSR) system standard that provides a client/server framework for human-computer interaction. In this framework, the client side performs the speech collection and signal processing (feature extraction) using software and hardware collectively termed as a front end. The processed data is transmitted to the server for recognition and subsequent processing. The exact form and function of the front end is a design factor in the overall DSR structure. Our collaboration with ETSI focuses on evaluating the performance of different front ends on the WSJ task for a variety of impairments:

Additive Noise: Six noise types collected from street traffic, train terminals and stations, cars, babble, restaurants and airports at varying signal-to-noise ratios (SNRs) are artificially added. For a training utterance, the noise is randomly chosen out of these six different conditions and an SNR between 10 and 20 dB in steps on 1 dB is randomly chosen. For each of the six test conditions, the SNR is randomly chosen between 5 and 15 dB in steps of 1 dB. Further, the frequency characteristics of telecommunications equipment are simulated for all test conditions. G712 filtering is used to simulate the frequency characteristics of the terminals and P341 filtering defined by the ITU is used to simulate the telecommunications transmission equipment.
Sample Frequency Reduction: Two sampling rates, 16kHz and 8kHz, are used. Current telecommunications technology operates at a sampling rate of 8 KHz but a goal of next-generation technologies is to increase this to 16 KHz in order to increase the quality of the transmitted speech.
Microphone Variation: Two microphone conditions, a Sennheiser microphone and a second microphone collected with the WSJ0 data, are tested.
Compression: two feature types, compressed and uncompressed, are used. A compression scheme defined by the DSR front end is used Model Mismatch: two training conditions are tested. One set of models is trained on only clean data with the Sennheiser microphone and one set is trained on a combination of clean and noisy data under both microphone conditions. Both models are tested against the noisy test data.