Why is conversational speech recognition much more difficult than read
speech (such as Broadcast News, Wall Street Journal, etc)? Simply put,
there are many more factors that come into play when humans converse
naturally than when a person is reading prepared text. Added to this
is the effect of telephone bandwidth and line noise. To demonstrate
this, we have prepared a
set of examples
illustrating the varying difficulty levels of speech recognition tasks.
The main factors contributing to the difficulty of SWITCHBOARD
recognition are:
-
Conversational Speaking Style: The language model is
effected due to
disfluencies
, syntactic and
discourse
patterns, and linguistic incoherence of automatic
segmentation. Of particular difficulty are the false starts,
interruptions, and bad grammar.
-
Pronunciation Effects:
Speaking rate variability and
reduced pronunciations
of conversational speech are difficult enough for human recognition,
and they only compound the already difficult task of automatic speech
recognition. A key issue in this problem is the poor correspondence
between pauses and phrase boundaries.
-
Telephone Bandwidth and Ambient Noise: The
limited bandwidth of telephone
degrades the original speech signal making acoustic discrimination
a challenge. Crosstalk, background speech, and ambient noise like
music and television all contribute to an increased error rate in
automatic recognition. Remarkably, humans do a very good job of
filtering out these types of noise, while our current statistical
modeling techniques fall far short.
-
Training Data: A wealth of accurately transcribed training data
is necessary to produce good acoustic and language models.
Unfortunately, the aforementioned problems are not a problem only
during automatic recognition, they also come into play when humans
are transcribing the data. A few good examples of these problems can
be found on the SWITCHBOARD resegmentation project
FAQ.