Why is conversational speech recognition much more difficult than read speech (such as Broadcast News, Wall Street Journal, etc)? Simply put, there are many more factors that come into play when humans converse naturally than when a person is reading prepared text. Added to this is the effect of telephone bandwidth and line noise. To demonstrate this, we have prepared a set of examples illustrating the varying difficulty levels of speech recognition tasks.

The main factors contributing to the difficulty of SWITCHBOARD recognition are:

Conversational Speaking Style: The language model is effected due to disfluencies , syntactic and discourse patterns, and linguistic incoherence of automatic segmentation. Of particular difficulty are the false starts, interruptions, and bad grammar.
Pronunciation Effects: Speaking rate variability and reduced pronunciations of conversational speech are difficult enough for human recognition, and they only compound the already difficult task of automatic speech recognition. A key issue in this problem is the poor correspondence between pauses and phrase boundaries.
Telephone Bandwidth and Ambient Noise: The limited bandwidth of telephone degrades the original speech signal making acoustic discrimination a challenge. Crosstalk, background speech, and ambient noise like music and television all contribute to an increased error rate in automatic recognition. Remarkably, humans do a very good job of filtering out these types of noise, while our current statistical modeling techniques fall far short.
Training Data: A wealth of accurately transcribed training data is necessary to produce good acoustic and language models. Unfortunately, the aforementioned problems are not a problem only during automatic recognition, they also come into play when humans are transcribing the data. A few good examples of these problems can be found on the SWITCHBOARD resegmentation project FAQ.