Initial versions of SWITCHBOARD were primarily used for developing
speaker recognition systems. A preliminary evaluation of LVCSR systems
using SWITCHBOARD was done in late 1994. This was followed by a full
evaluation in April 1995 where the best system gave a WER of 43% with
speaker adaptation and 50% without speaker adaptation. Since these
initial evaluations there has been slow but considerable improvement
in the performance of LVCSR systems. This is evidenced by the
most recent evaluations
held in Fall 1997. The table below summarizes the results.
  SITE   |
  SWB-II WER   |
  CALLHOME WER   |
  OVERALL WER   |
  BBN   |
35.5 |
53.7 |
44.9 |
  BU   |
41.5 |
58.2 |
50.1 |
  CMU-ISL   |
35.1 |
54.4 |
45.1 |
  CU-HTK   |
39.2 |
57.6 |
48.7 |
  DRAGON   |
39.9 |
57.4 |
48.9 |
  SRI   |
42.5 |
57.5 |
50.2 |
Fall 1997 NIST Hub-5E Evaluation Results
Parallel to work done as part of the evaluations, the summer workshops
held at CLSP, Johns
Hopkins University have provided significant insights
into problems posed by conversational speech, such as in SWITCHBOARD.
1996 LVCSR Workshop
-
Speech Data Modeling: This work involved the use of an
ANN-based hybrid system for experimentation with multi-band and
multi-scale input features. Their main aim was to observe speaker and
speech variations through signal processing and acoustic modeling.
The best performance achieved was 59% WER compared to a baseline of
64.9% WER.
-
Data-driven Pronunciation Modeling: This work was based on
the premise that conversational speech contained more
pronunciation variations than was provided by traditional phone-based
lexicons. Thus, they developed a decision-tree based technique to
learn mappings from baseform phones to alternate pronunciations.
The best performance achieved was 45.3% WER compared to a baseline of
46.4%.
-
Hidden Speaking Mode Modeling: During this work a new
conditioning variable, the mode, which reflected dynamic features
of the observed speech from acoustics and text was introduced.
The acoustic features included speaking rate, presence and duration of
silence, counts of long and short pauses, etc. Language model
features included discourse function of the utterance, presence
of disfluencies, word frequencies, etc. The best performance achieved
was 53.7% WER compared to a baseline of 54.8% WER.
-
Dependency Language Modeling: This work was an attempt to use
linguistic structure to get better language models for conversational
speech. The use of dependency grammars and Maximum Entropy modeling was
pursued. The best system was obtained through experimentation with
multi-band and multi-scale input features. The best performance
achieved was 45.4% WER compared to a baseline of 46.2%.
1997 LVCSR Workshop
This workshop focused on all aspects of conversational speech recognition.
The primary areas of research were multi-scale acoustic modeling,
discriminant analysis, syllable-based speech processing, pronunciation
modeling, and integration of discourse-level information into the recognition
process.
Here is a
comprehensive summary
of the work done during this
workshop.