ADVANCES IN SPEECH RECOGNITION TECHNOLOGY THROUGH CORPUS DEVELOPMENT ADVANCES IN SPEECH RECOGNITION TECHNOLOGY THROUGH CORPUS DEVELOPMENT By Jonathan E. Hamaker A Thesis Submitted to the Faculty of Mississippi State University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering in the Department of Electrical and Computer Engineering Mississippi State, Mississippi May 1999 i ii iii iv v vi Copyright by Jonathan E. Hamaker 1999 By Jonathan E. Hamaker Approved: Joseph Picone Associate Professor of Electrical and Computer Engineering (Director of Thesis) James C. Harden Graduate Coordinator of Computer Engineering in the Department of Electrical and Computer Engineering G. Marshall Molen Department Head of the Department of Electrical and Computer Engineering A. Wayne Bennett Dean of the College of Engineering James E. Fowler Assistant Professor of Electrical and Computer Engineering (Committee Member) Lois C. Boggess Professor of Computer Science (Committee Member) Richard D. Koshel Dean of the Graduate School Name: Jonathan E. Hamaker Date of Degree: May 13, 1999 Institution: Mississippi State University Major Field: Computer Engineering Major Professor: Dr. Joseph Picone Title of Study: ADVANCES IN SPEECH RECOGNITION TECHNOLOGY THROUGH CORPUS DEVELOPMENT Pages in Study: 36 Candidate for Degree of Master of Science Your abstract here. No more than 150 words for a Master's Thesis and no more than 350 words for a Ph.D. Dissertation DEDICATION I would like to dedicate this to Homer Simpson. ACKNOWLEDGMENTS The author is forever indebted to the King of Swing, Benny Goodman. CHAPTER I INTRODUCTION Progress on speech recognition technology has been impressive. There are now commercial products that allow automatic dictation, telephone voice interfaces, and voice activated appliances. It has been ovTwo to three paragraphs explaining the state of speech research today melding into how the important role that data plays in developing systems that can handle *true* conversational speech. Could draw the analogy to science fiction. The SWITCHBOARD Task Poised at the center of this expanding research is the SWITCHBOARD (SWB) Corpus. A paragraph or two about the corpus itself and how it stands in the big scheme of state-of-the-art speech research. Objective of the Study Much work in recent years has been devoted to building robust conversational speech recognition systems. At the center of that work have been corpora such as the SWITCHBOARD Corpus. Error rates on this task quickly decreased but in recent years the performance has stagnated. A great deal of effort has been spent in finding workarounds for the deficiencies inherent in the SWITHCBOARD transcriptions and segmentations and in drawing conclusions from the faulty data. This thesis endeavors to alleviate the need for aversion through building a sound All of the usual tricks have been employed and the research community has reached a point where it must Researchers are often seen at conference touting getting around the problems in the switchboard transcriptions and many conclusions have been drawn according to the faulty transcriptions. In this work we examine the effects of clean segmentations on this work. In particular we attempt to quantify the recognition performance improvement which can be gained by starting with a clean set of transcriptions. Also we attempt to determine the extent to which the work of those working with the SWB corpus was effected by the poor quality of the transcriptions and segmentations. Organization of Thesis This thesis, as described above, builds upon the body of work dealing with the SWITCHBOARD Corpus toward the ultimate goal of providing insight into the speech modeling process. Of particular concern are the effects of the training data on the overall ability to build robust speech recognition systems. The main body of the thesis is divided roughly into two parts, the second building on the first. The first section of the thesis presents an overview of the speech corpora as it pertains to state-of-the-art speech recognition research. In Chapter I, a brief history of speech recognition research is given to provide the reader with the motivation for the topics being considered in this thesis. Chapter II details the segmentation and transcription strategies applied in this work and gives anecdotal evidence of their utility. This chapter also includes a synopsis of previous segmentation and transcription strategies, listing their respective advantages and shortcomings. The second portion of the thesis provides experimental evidence of the validity of the approach investigated. Chapter III examines the use of the new segmentations for language modeling experiments that, to date, had been tailored around faulty transcriptions and inadequate segmentations. Chapter IV examines the strength of the new segmentation scheme in terms of a standard recognition evaluation. A summary of the thesis is presented in Chapter V followed by suggestions for extensions of this work in Chapter VI. CHAPTER II HISTORICAL PERSPECTIVE The motivation for this work can be traced as far back as the 1950s with the implementation of phonetic typewriters [1, 2]. Though these devices fell far short of their expectations, solving the speech problem, the framework we use today can be linked to their efforts. Their goal was to "use acoustic features of speech and a knowledge of phonetics to turn flowing speech into phonetic representations" [3]. In a broad sense, this can still be seen as the goal of speech recognition scientists throughout the world. Based on this ground-breaking work, researchers at Bell Labs, RCA, and MIT used the 1960s as a testing ground for more expansive and complex systems. Systems using dynamic time warping, and pattern matching were introduced which performed well on small isolated word tasks [4]. However, they were still unable to make good progress into LVCSR systems. In the 1960s, 70s and 80s, impressive strides were made in the field which pushed it in a new direction. Groups at T. I., I.B.M., and others began releasing commercial products in speech recognition. In the 60s and 70s these products were able to solve what is now the somewhat trivial task of recognizing discrete utterances in a relatively noise-free environment. These were somewhat specialized and had limited vocabularies, but by the late 1980s systems existed which were capable of recognizing a vocabulary of 20,000 words spoken in isolation. Speech research in this time was characterized by a movement from template-based technology to statistical modeling approaches. These advances were owed, in part, to the advances in digital computing over the years. Scientists were able to develop more complex, more robust, and yet less expensive systems. Thus they were able to attempt recognition strategies which would have been impossible with the old analog and digital systems. In addition to advances in computing and science, funding agencies have played large roles in speech research. In the late 1980s the Defense Advanced Research Projects Agency (DARPA) established programs to give the industry a push toward continuous speech and large vocabulary applications. These efforts, and DARPA's, support have extended into the 1990s and is a driving force yet today. In recent years, DARPA has supported many large programs such as Wall Street Journal [5], HUB [6] tasks and SWITCHBOARD [7] corpora and evaluations. It is the latter of these tasks that this thesis deals with in detail. The SWITCHBOARD Task In the early 1990s, the Department of Defense (DoD) and DARPA saw the need for a large amount of data from a variety of speakers to be used for a number of speech research needs including speech recognition, speaker recognition, and topic spotting. Previous common evaluation tasks, such as the Resource Management (RM) [8] and Air Travel Information System (ATIS) [9] tasks, had been narrow in scope and covered only a few speakers. Texas Instruments was sponsored by DoD in 1990 [10] to collect the SWITCHBOARD (SWB) Corpus. In 1993, the first LDC release of the corpus occurred. In addition to transcriptions, this release included transcriptions segmented by conversation turn boundaries, and time alignments for each word based on a phone-level supervised recognition. SWB was a great example of the trials and tribulations of database work, in that the quality of the data suffered from a lack of understanding of the problem. Word-level transcription of SWB is difficult, and conventions associated with such transcriptions are highly controversial and often application dependent. The data was subsequently used for many types of research for which it was never originally intended. Hence, by 1998, the quality of the SWB transcriptions for LVCSR was recognized to be less than ideal, and many years of small projects attempting to correct the transcriptions had taken their toll. Numerous versions of the SWB Corpus were floating around; few of these improved transcriptions were folded back into the LDC release; and many sites had spent a lot of research time cleaning up a portion of the data in isolation. The SWB Data Collection Paradigm SWB was the first database collected of its type: two-way conversations collected digitally from the telephone network using a T1 line. In retrospect, a number of issues in this type of data collection have surfaced - most notably a problem involving echo cancellation. In the original SWB data collection, echo cancellation was not always activated because the phone calls were bridged within the SWB data collection platform, and hence appeared as local calls to the network. This resulted in a significant portion of the data having serious echo. As described later, in this work echo cancellation was used during transcription to counteract this problem. There are also a variety of real-time problems evident in the SWB corpus. For example, some conversations experience a loss of time synchronization between channels of the data. This causes serious problems for the echo canceller, which assumes a fixed or extremely slowly varying delay between the source signal on one channel, and its echoed version on the other channel [11, 12]. Sometimes the echo appears before the source signal - clearly indicating a loss of data somewhere. Similarly, occasionally data appears to be lost without any corresponding error reports, causing unnatural chops in the audio files on one or both channels. Sometimes the missing data is filled with a run of zero amplitude values. In a related problem, data has been observed that is "out of order" (the latter part of a word comes before the first part of the word) signaling that perhaps buffers have been swapped or overwritten during collection. Finally, some conversations suffer from the introduction of digital noise due to out-of-band signaling. Many of these problems are nicely summarized in an FAQ [13] developed during this study which is maintained as a service to the speech research community. A Historical Perspective on the SWITCHBOARD Transcription Problem SWB, in its entirety, consists of 2438 conversations totaling over 240 hours of two-channel data from 541 unique speakers. The average duration of a conversation is six minutes, as shown in Figure 1. Of the 500 speakers present in the corpus, 50 speakers contributed at least one hour of data to the corpus. A distribution of the amount of data from each speaker is shown in Figure 2. The first half of the database was transcribed by court reporters; the second half by hourly workers employed by TI. Since SWB was one of the first conversational speech corpora of its type, conventions for transcription were extremely controversial, and there was not much of an inventory of prior art [10]. The two main goals of the transcription conventions were consistency and utility in speech and linguistic research. Human readability was also important because it aided in the quality control steps taken after transcriptions were complete. It was decided that conversations would be broken at turn boundaries (points at which the active speaker changed) and that a simple flat ASCII representation for the orthography would be used. Quality control steps included spell-checking the transcriptions, checking for misidentification of speakers, and looking for common language or spelling errors (its, it's, they're, their, there, etc.). After the transcriptions and quality control steps were complete, time alignments were generated which estimated the beginning time and duration of each word. Finally, a rough check of the time alignments was made by playing samples of each conversation at several places throughout the speech file; errors of over one second usually resulted in reprocessing the data [14]. Segmentation and Its Impact on Technology Development Initial LVCSR systems had high recognition error rates on SWB - approximately 70% in the early and mid-1990's. The sources of this degraded performance include the lack of a robust language model (which had proven to be effective on the Wall Street Journal task) and poorly calibrated acoustic models (i.e. there is a good degree of mismatch between the training and test database when one examines acoustic scores). The difficulties in recognition arise from short words, telephone channel degradation, and disfluent and coarticulated speech. In an effort to reduce error rates, many state-of-the-art systems introduced dynamic pronunciation models [15] and a flexible supervised training procedure [16]. Over the years, WER on various subsets of SWB have fallen to the mid-20% range [17] and in the low 30% on standard evaluations. However, as performance improvements become less dramatic, and most of the obvious obstacles to performance are overcome, the quality of the training database soon becomes an issue. A Casual review of the SWB Corpus as processed by most sites quickly reveals the fact that much of the data is discarded due to the unreliable transcriptions. Pilot studies at the 1997 Research Workshop on Innovative Techniques for LVCSR (WS'97) made it evident that improving the quality of the database through resegmentation and transcription corrections could greatly improve the resultant acoustical models being used for LVCSR experiments. Simply resegmenting the test database resulted in a 2% reduction in WER [18]. CHAPTER III METHODOLOGY In the past, speech segmentation was guided by linguistic or acoustic information metrics. To linguistically segment data, one places boundaries in natural breaks in speech (between phrases, sentences, turns, etc.). In acoustic segmentation, boundaries are placed in acoustic silence between words. Though both are commonly used, each of these methods has its drawbacks. Both historically have resulted in utterance definitions that have truncated words at the beginning or end of the resulting speech file. Linguistic segmentation is effective in maintaining clear linguistic context, but it has two important problems. First, if the boundaries are based solely on language rules and not on acoustics, boundaries may be placed between words where there is little or no silence. This will result in word beginnings and ends being cut off which will adversely effect training of acoustic models. Second, linguistically based boundaries often result in utterances which are too long for experimental recognition systems. Speakers in SWB sometimes carry on monologues of the same thought for 30-60 seconds, but the ideal utterance length for experimentation is closer to 10 seconds (note that common evaluations have often used much shorter utterance definitions). Segmenting speech based solely on acoustic boundaries also has its advantages. It is a more desirable paradigm in that boundaries are only placed where there is a significant pause in speech. This is a necessity in terms of training acoustic phonetic units for recognition systems. However, this method obscures any inherent linguistic context. Thus, it is of no use when training language models. It is clear, then, that the key is to strike a balance between these competing paradigms: manually placing boundaries where there is acoustic silence, maintaining linguistic context, and regulating the length of the utterances. A large portion of this work examines the utility of segmentations at boundaries that represent a compromise between the need for strong linguistic and acoustic models of conversational speech. The need for that compromise and the means to achieve it are detailed in this chapter. The net result is an utterance definition with ample amounts of silence at the beginning and end of the file, but which contain, at the very least, a linguistically meaningful unit. All data is accounted for in these segmentations, so utterance definitions involving larger linguistic units can be easily built from these segmentations. Figure 3 gives an overview of the data production process applied in this work. Echo Cancellation Echo cancellation is simple and consists of passing the speech data through a standard least mean-square error echo canceller [12] currently in use in the DoD speech community. Removal of echo is a very important step which makes the job of transcription much easier. Echo is a major source of trouble on SWB - transcribers will commonly make wrong channel assignments when operating on the original data because the amplitude of the echoed speech is often on par with that of the speech data from that channel. Of course, mistranscribed data of this type wreaks havoc on the training of acoustic models and language models. Providing transcribers with echo-cancelled data should fix the causes of many of the swapped channel problems that have plagued SWB in the past. Resegmentation The resegmentation of the SWB data is what is expected to provide the most substantial results in this work as the ability to train recognition models is closely tied to the structure of the input data. At the 1997 Speech Recognition Workshop, similar resegmentation work on the test database resulted in a 2% reduction [18] in word error rate (WER). Resegmentation is a challenging part of the correction process because a decision must be made on whether to split at natural linguistic boundaries (sentence boundaries, turn boundaries, phrase boundaries, etc.) or to split at acoustical boundaries where there is a pause between speech. The strategy used in this work is as follows: · Segment at locations where there is clear silence separating each segment · Segment along phrase, sentence, and/or train-of-thought boundaries. The first rule is important because it eliminates the problem of truncated words due to segment boundaries falling where there was not enough separation between words. This has a negative effect on training of acoustic models since it diminishes one's ability to accurately model coarticulation effects and it may attribute acoustics to the incorrect word of the coarticulation pair thus training the model with out-of-class data. The second rule is implemented to maintain linguistic context and clarity for speech understanding and language modeling experimentation. These general guidelines were modified to produce the short set of specific guidelines shown below: · Each utterance should be padded by a nominal 0.5 second buffer of silence on both sides. In general, these silence buffers can range from 0.35 to 0.75 seconds. This provides ample silence at the start and end of utterances to negate the possibility of acoustic information being truncated. · The boundary can only be placed in a "silence" consisting solely of channel noise and background noise. Whenever possible place the boundary in a section with very low energy (visually speaking, this is a flat part of the signal.) It is the intention of this work to have "clean" utterances where each boundary is in a point of silence, each utterance is buffered by silence, and each utterance contains a meaningful phrase. Boundaries in noise locations cause corruption of delta features leading to less accurate acoustic models. · The 0.5 second buffers can contain breath noises, lip smacks, channel pops, and any other non-speech phenomena. However the boundary can not be placed in a noise of this sort. · No utterance can be longer than 15 seconds. As an utterance approaches 15 seconds in length, the validator is allowed to find a point of segmentation that will generate silence buffers less than 0.5 seconds but not less than 0.1 seconds. This rule ensures that the data generated is suitable for use in speech recognition systems. Utterances longer than this can produce a search space which extends beyond the capability of common computers to deal with efficiently. · Every utterance containing only silence must be greater than 1.0 seconds in duration. Otherwise the silence region could be used as part of the buffer for the previous and next utterances. · Whenever possible choose a segmentation that maintains the phrase structure of the conversation. This means that the ideal utterance would contain a single phrase. However, due to the nature of the SWB data, this is not always possible. This rule is extremely important in terms of language modeling. Past segmentations placed boundaries in locations which were not conducive to studying the true communication process in continuous conversational speech. · The end of the preceding utterance coincides with the start of the next utterance. Hence all data is accounted for. Segmentation essentially involves placing a boundary between two utterances. · Consider a stretch of silence which has small amplitude noises embedded in it as a silence only utterance - do not mark the noise and do not segment the noises into separate utterances. However, if a noise has a particularly high amplitude, then segment it into its own utterance. Speech recognition systems need to be robust to low-level noise in the channel and need to learn how to distinguish between noise and speech. These sorts of utterances give researchers examples of noisy data. Transcription Correction Though segmentation is the primary focus of this work, the SWB transcriptions are also a source of problems in building recognition systems. A detailed review of a small section of the original transcriptions revealed that, on average, 8% of the words transcribed are in error [19]. It is difficult enough training models for conversational speech when the training data is pristine, but in this case models were being trained on out-of-class data. Thus, it was necessary in this preliminary stage to also revise the transcriptions. A highly detailed list of transcription rules was created that a validator would use to handle transcription of partial words, mispronunciations, and proper nouns. These rules originated from the LDC transcription conventions [14] released with the SWB Corpus. Significant changes were made to the original LDC transcription conventions to ensure the highest level of accuracy and consistency in the transcriptions. A complete description of the modified transcription conventions [20] is maintained on the web site developed for this work. Most of the conventions described in this document were also discussed with experts in linguistics and speech technology in a mailing list maintained as part of this work. Many of the transcription rules were a by-product of problems pointed out by validators. Each time a validator was not able to easily arrive at a transcription by following the conventions, a rule was added to maintain clarity and consistency. In such a case the speech research community was solicited for input to arrive at a consensus. Listed below are a few of the more interesting and difficult issues that were encountered: · Title capitalizations: Speakers often refer to titles in their conversations. There is a debate as to how to capitalize these proper nouns. The question was whether each word in the title should be capitalized (example: "Gone With The Wind") or if standard grammatical rules should be followed by capitalizing the first word, last word, and keeping prepositions under five letters lower case (example: "Gone with the Wind"). The latter of these was followed in this work. · Compound words: It was difficult for validators to be consistent with the transcription of compound words (example: "everyday" vs. "every day"). To avoid inconsistencies throughout the transcriptions all compound words were transcribed as one word regardless of context unless there was a definite acoustic pause between the two words. · Coinages: Speakers often use words in their speech and attribute meaning to these words though they do not occur in the dictionary (example: the person who sells the gun ought to protect themself). In this example, "themself" is not a proper word, but the speaker is using it as if it was. The convention on these words, called coinages, is to transcribe the word in braces - in this case, "{themself}". · Mispronunciations: Occasionally speakers mispronounce a word or say a word they didn't mean and then correct themselves (example: I blame the splace space program). Here the caller accidently said "splace" and then corrected the mistake by saying "space". The transcription in such cases consists of the phonetic spelling of the misspoken word and the word they meant to say separated with a slash and all enclosed in brackets. The example is corrected as "I blame the [splace/space] space program". · Vocalized noise: Several examples have been cited of a speaker making a sound that can not be deciphered as a word or partial word and also can not be classified as coughing, breathing, or any of the other usual non-speech noises (example: she was able to pull out of it uh d- w- so cheaply the second time). This speaker uses the "d- w-" as a hesitation sound. Such cases are transcribed with the tag [vocalized-noise]. · Partial words: Speakers commonly start, but do not finish the acoustics of a word (this is known as a false start) (example: if the speaker began the word "space" but only said "spa-"). The convention for these cases is to transcribe the part of the word that was said, and enclose the rest of the word in brackets followed or preceded by a dash. This provides the full word context for language modeling applications. In this example: "spa[ce]-" would be correct. · Laughter words: The original LDC transcription conventions transcribed laughter alone, but there was no convention for transcribing the act of a person speaking while simultaneously laughing. This occurs quite often in conversational speech. This is annotated in the new transcriptions by transcribing laughter and the word spoken separated by a hyphen and all enclosed in brackets. An example is "[laughter-yes]". · Asides: A situation that occurs relatively infrequently in SWB is when one of the two speakers in the conversation talks to a person in the background. In the past, this may have been transcribed as [noise], as part of the normal transcription, or, worse, not transcribed at all. This could have dire consequences for training or testing a system since the acoustics for these "asides" would be on the same level with the conversational acoustics. Also, these asides will often carry over into the conversation between the two primary speakers. The practice was adopted of transcribing the parts of the conversation spoken as asides between the markups "" and "". (example: " excuse me what's the matter sweetie you need to wash your hands maybe Paw-Paw can help you sure sorry") These and many other transcription issues can be found on the FAQ maintained as part of this work [13]. The biggest challenge in transcribing SWB is the transcription of words that are mumbled, distorted, or spoken too quickly by the caller. Even after listening to the words dozens of times and drawing from as much context as possible, there are still times where the validator must make what amounts to an educated guess. These problems result in most of the final word errors in the revised data. It could certainly be debated that these sorts of words are of no use for training acoustic models, regardless and, in fact, may be a detriment to the model. However, it was the practice in this work to transcribe all speech in the database with the most likely word given all of the available information. An Overview of the Segmentation Tool The segmentation tool [21] developed as part of this work is a graphical, point-and-click interface tool designed to expedite the segmentation/transcription process. This tool, is written entirely in C/C++ interfaced to Tcl/Tk and is designed to be highly portable across platforms. The tool greatly streamlined the segmentation and transcription process. Its most fundamental design feature is that all speech data must be accounted for. Silence regions are explicitly marked; no audio data is ignored in the transcription process. A screenshot of this tool is shown in Figure 4. This tool has a short and easy learning curve that results in a short training period for a validator, while providing them with a powerful interface for efficiently altering the utterance boundaries and transcriptions. The display area of the tool provides the validator with instant access to the acoustic waveforms, the audio context for any utterance, as well as the functionality to zoom in and/or play a selected portion of the utterance. An additional word-alignment mode allows a validator to check the transcription accuracy word-by-word at high speeds, thus providing a efficient means of maintaining strict quality control. The audio tools embedded in the segmentation tool are obviously an important part of the tool. Each channel of the two-channel signal (often mistakenly referred to as a stereo signal) can be reviewed independently, or both channels can be heard simultaneously. Two-channel audio is an integral part of the SWB task, since it allows the transcribers to probe each side of the conversation separately or listen to the full context. Merging or splitting utterances is as simple as clicking a button. There are features to delete or clear the transcriptions of the current utterance or to insert a new, blank utterance. Transcriptions are easily modified and convenient key strokes make it easy to move between utterances. Data Verification The goal of revising the corpus is to produce segmentations and transcriptions with the highest possible quality. As with any process involving manual validation, this process is subject to occasional error. To combat this problem, an extensive set of data verification and quality control procedures were instituted. An overview of this process is shown in Figure 5. This process will typically result in the review of all of the data via the quality control utilities and 10%-20% of the data in detail. In this stage, problem utterances are marked by the quality control scripts and each of these is reviewed and corrected if necessary. At this point any questions logged by the validators regarding segmentation and transcription are reviewed and decided upon via interaction with the research community. Beyond this set of internal reviews, an incremental release of the data was made to the public domain for review by speech technology experts who played a part in the quality control process via their feedback. At the core of the quality control regimen are a set of utilities that automatically tag utterances that have common errors such as misspellings and boundaries in noise. The sequence of scripts used is shown in Figure 6. Notice that the process is iterative as each marked problem must be adjudicated before the conversation is released. Each of these checks is described in detail below. · check_bounds: This utility will find all gross errors in the boundary alignment. The utility verifies that every sample of data in the speech file is accounted for by the transcription start and end times. It does so by making sure that the start time of every utterance (or word in the case of word alignment files) is equal to the end time of the previous utterance or word. It also checks that the end time of the last utterance or word is equal to the last sample in the file and that the start time of the first utterance is zero. · check_silence: One of the segmentation conventions is that every utterance marked as containing only silence should be at least 1.0 second long. At times the validators intend to merge a pair of utterances but unintentionally leave a dangling silence-only utterance which is extremely small. This utility finds these problems by tagging all utterances that are transcribed as "[silence]" but are shorter than a specified minimum duration. · utterance_hist: To maintain efficient recognition performance, it is believed that the average SWB utterance should be between 8 and 10 seconds long and should rarely be greater than 15 seconds or less than 2 seconds. The utterance_hist utility accepts a list of transcription files and flags utterances in those files whose duration falls outside of the accepted range (2 secs - 15 secs). It also produces comprehensive statistics for that list of files including: number of conversations processed; number of non-silence and silence-only utterances; number of words; hours of non-silence and silence-only data in the conversations; mean duration of non-silence utterances; standard deviation of duration among non-silence utterances; maximum and minimum utterance lengths. These statistics are used to characterize the data produced and to search for any trends in the data which would lead to problems in the revisions. · check_speech_rate: Most gross errors in transcriptions (such as accidentally replicating part of the transcription twice in one utterance) can be easily found by examining the speech rate of each utterance. This is a measure of the number of words transcribed per second of speech in the utterance. The majority of correct utterances have rates between 0.5 and 5.0 words per second. Thus, this script flags any utterances which have speech rates outside of this range. There are, of course, utterances which are in error yet still fall within the range of accepted rates. The number of these is minimal in the final data and is usually found in the word alignment stage. · check_energy: A speech recognition system typically assumes that the start of the utterance is in a state of silence. This allows the system to assume initial conditions for computation of differential features, etc. For this reason, boundaries which occur in regions of high energy (e.g. noise bursts, laughter, etc.) are detrimental. This utility is the primary means for verifying that the boundaries are being placed in low-energy areas. A standard algorithm is used to determine the nominal channel energy level. For each utterance in a conversation, check_energy finds the average energy of a window around the boundary. If that average energy is larger than the noise floor of the conversation by a certain amount (typically 25 dB) then the boundary is flagged as occurring in an impulsive noise. This method has been extremely successful in finding boundaries placed in noise or echo. · check_dictionary: As part of this project a dictionary is maintained which contains all of the words that are allowable for transcriptions. This dictionary provides a pronunciation for each word in the conversations. With each corrected transcription words that are currently not in the dictionary are found - these are usually partial words, proper names, or laughter words. check_dictionary is used to find those words that are in the transcriptions but which are not in the dictionary. Each of these are individually reviewed and, if the word is correct in the transcription, are added to the dictionary. This provides a pointer to any misspelled words or misused words in the transcriptions. Using this utility is not foolproof since words can be mistranscribed in the transcription though they do appear in the dictionary. An example of this is a transcription of "World War I" which should be transcribed as "World War One" by convention, but since "I" is in the dictionary, check_dictionary will allow this phrase to pass. Errors like this are either caught through the other quality control scripts or when manual word alignments are performed. Cross-Validation Cross-validation is central to evaluating the performance of the validators as well as the quality of the transcriptions. In these tests, a number of validators segment and/or transcribe the same conversation. These transcriptions are compared for accuracy and consistency. Each validator's transcriptions are checked against a reference determined upon careful review by a set of experienced researchers. This is a blind test, so the validators are unaware that they will be scored on that particular conversation. The transcriptions of each validator and the original LDC transcriptions are compared to the reference to provide an estimate of the improvement in SWB transcriptions after resegmentation. The results of a recent cross-validation experiment are detailed in Table 1. This test was a transcription cross-validation where the validators transcribed data from the same segmentation of conversation sw2137 and were scored against a reference that was also transcribed from that segmentation. The errors shown in the table are significant errors which only include deletion, insertion, or substitution of a word. These specifically do not include minor differences in partial words, differences in transcription conventions (when scoring the LDC data), and marking of noises. One can see from the table that the revised transcriptions better the LDC transcriptions by a significant margin. CHAPTER IV CORPUS ANALYSIS Words per utterance A primary goal of this work is to produce utterances which contain meaningful phrases such as sentences or complete thoughts. From this, one would think that, on average, the number of words per utterance would be large. The histogram of Figure 5 bears this out but also reveals an interesting trend. From the figure, one sees that over 25% of the utterances are one-word utterances explaining the relatively short mean utterance duration from Table 4. There is a long tail after the one-word utterances which gives a mean value of over 12 words per utterance. The more significant result of this plot is related in Table 5. Four words (all affirmations) account for over 70% of the one-word utterances. With only 14 words we can cover 90% of the one-word utterances. It is likely that this information could be used to tune a language model to short utterances such as affirmations or to constrain advanced systems which are able to determine the number of words in the utterance before hand. Utterance lengths We believed that the majority of SWB utterances containing a single phrase would be between seven and eight seconds in length with sufficient silence buffers. The histogram of Figure 6 tells a different story. A large portion of the utterances (close to one-third) are less than two seconds long. This directly correlates with the distribution shown in Figure 5 where the one and two-word utterances are dominant and is a fall-out of conversational speech - one-word replies abound. If we remove the one-word utterances from the data then we do find that the distribution of utterance lengths has a mean of close to 6.5 seconds which is more reasonable for the desired long phrases. Speech rate It is well known that speech rate is directly related to one's ability to accurately transcribe speech data. The speech rate is also proportional to speech recognition system performance. In Figure 7 we see that the SWB speech rates actual take on a bimodal distribution. A large percentage of the utterances have speech rates less than one word per second. For the most part, these are one-word utterances where the amount of speech used to calculate the speech rate is masked by the silence buffers on either side. Our quality control scripts flag all utterances with speech rates less than 0.5 words per second and greater than 4.5 words per second. CHAPTER V EXPERIMENTS Preliminary LVCSR Experiments Of course, the goal of this work is to improve LVCSR performance on SWB. Not surprisingly, monosyllabic words dominate the corpus in terms of word tokens and errors [11]. A natural question to ask is what happens if we update our best Hidden Markov models (HMMs) on the new transcriptions that have been described in this report. There are a number of practical problems that currently prevent us from doing this experiment as thoroughly as we would like. Most notably, we do not have an in-house capability for generating good lattices, so we can't evaluate on the retranscribed dev test and evaluation databases. Instead, we can simply reestimate models and evaluate on the existing test database. In such a scenario, there is a great potential for language model mismatch to unduly influence the results. In this section, we describe a very preliminary experiment in model reestimation that we believe demonstrates the potential of a corrected corpus. We have adapted existing acoustic models and trained these models with a training set of 376 resegmented conversation (about 20 hours of speech including silence). The data was evaluated on existing WS'97 lattices so that we could get a quick preview of the potential improvements in WER. Four reestimation passes were made. Laughter was used to update the baseline silence model, and words containing laughter were substituted with their baseform. The adapted models achieved a 1.9% absolute improvement in WER over the baseline system. The results of this experiment are shown in Table 4. This is a significant improvement that rivals the type of improvements one expects from algorithm advances. We further analyzed this result by sorting errors based on whether or not the error involved a monosyllabic word. We see, not surprisingly, that the 1.9% WER improvement also was observed for monosyllabic words. In other words, the new transcriptions have in fact helped improve the overall performance of the system on monosyllabic words. Equally encouraging is the fact that performance on non-monosyllabic words also showed similar improvements (the only negative point is that insertions rose slightly). Our significant improvement in recognition of monosyllabic words on this limited experiment is a preview of things to come on the entire database. We expect the gain due to the new transcriptions will be comparable to that achieved over one to two years of algorithm research (based on results cited in the common evaluations). We have completed resegmentation and transcription corrections of 525 conversations of the SWB training corpus. A summary of the modalities of this data is displayed in Table 1. The corresponding numbers on a similar subset from the WS'97 data [7] are also provided. It can be seen that our transcriptions and segments are significantly more detailed, with explicitly marked silence, dysfluencies, laughter, and partial words. The WS'97 set had these omitted, besides partially skipping some conversations. We also observed that of the 100 most frequent words in the resegmented database, 69 are monosyllabic and account for 53% of the total transcription. In the WS'97 data monosyllabic words constitute 74 of the top 100 words and cover 67% of the transcriptions. Acoustic Model Adaptation To estimate the impact of the resegmented training data on recognition performance, we needed to train new acoustic models. We decided to adapt existing acoustic models to this data and evaluate on existing lattices as this was a faster way of getting a preview of the potential improvements in WER. A word-internal triphone system [7] was used to bootstrap the seed models. The training set consisted of 376 conversations (about 20 hours of speech including silence, or approx. 27500 utterances) common to the baseline training. Four passes of reestimation were carried out. Since the baseline system lacks a laughter model, laughter was used to update the silence model; while words containing laughter were substituted with their baseform. Lattice Rescoring These word-internal acoustic models were used to rescore WS'97 dev test set lattices (the transcriptions of which have already been corrected as described in [7]). The performance of the adapted models (see Table 2) shows a 1.9% absolute improvement over the baseline system. It also reduces the error rate on substitutions and deletions, the main contributors to the error rate on SWB evaluations. Of the total errors, 63.3% are attributable to monosyllabic words and 4.7% are due to the various dysfluencies. This is significantly lower than the baseline system which had more than 70% of the errors due to monosyllabic words [7]. CHAPTER VI FUTURE WORK CHAPTER VII CONCLUSIONS CHAPTER VIII REFERENCES [1] D.B. Fry and P. Denes, "The Solution of Some Fundamental Problems in Mechanical Speech Recognition," Language and Speech, vol 1, pp. 33-58, 1958. [2] J. Dreyfus-Graf "Phonetograph Und Schwallellen-Quantelung," Proceedings of the Stockholm Speech Communication Seminar, Stockholm, Sweden, Sept. 1962. [3] J.R. Deller Jr., J.G. Proakis and J.H.L. Hansen, Discrete-Time Processing of Speech Signals, Macmillan Publishing Company, New York, New York, USA, 1993. [4] L.R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, New Jersey, USA, 1993. [5] D.B. Paul and J.M. Baker, "The Design of the Wall Street Journal-based CSR Corpus," Proceedings of the DARPA Speech and Natural Language Workshop, Harriman, New York, USA, February 1992. [6] D. Pallett, J. Fiscus, A. Martin and M. Przybocki, "1997 Broadcast News Benchmark Test Results: English and Non-English," Proceedings of the Broadcast News Transcription and Understanding Workshop, Lansdowne, Virginia, USA, February 1998. [7] J. Godfrey, E. Holliman, and J. McDaniel, "SWITCHBOARD: Telephone Speech Corpus for Research and Development," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 517-520, San Francisco, California, USA, March 1992. [8] P.J. Price, W.M. Fisher, J. Bernstein and D.S. Pallett, "The DARPA 1000-Word Resource Management Database for Continuous Speech Recognition," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 651-654, New York, New York, USA, April 1988. [9] C.T. Hemphill, J.J. Godfrey and G.R. Doddington, "The ATIS Spoken Language Systems Pilot Corpus," Proceedings of the DARPA Speech and Natural Language Workshop, pp. 96-101, Pittsburgh, Pennsylvania, USA, June 1990. [10] J. Godfrey, E. Holliman and J. McDaniel, "Telephone Speech Corpus for Research and Development," Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 517-520, San Francisco, California, USA, March 1992. [11] J. Picone, M.A. Johnson and W.T. Hartwell, "Enhancing Speech Recognition Performance with Echo Cancellation," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 529-532, New York, New York, USA, April 1988. [12] A. Ganapathiraju and J. Picone, "A Least-Mean Square Error (LMS) Echo Canceller," http://www.isip.msstate.edu/resources/technology/software/1996/fir_echo_canceller, Institute for Signal and Information Processing, Mississippi State University, Mississippi State, Mississippi, USA, December 1996. [13] J. Hamaker and J. Picone, "The SWITCHBOARD Frequently Asked Questions (FAQ)," http://www.isip.msstate.edu/resources/technology/projects/current/switchboard/faq, Institute for Signal and Information Processing, Mississippi State University, Mississippi State, Mississippi, August 1998. [14] B. Wheatley, G. Doddington, C. Hemphill, J. Godfrey, E.C. Holliman, J. McDaniel and D. Fisher, "SWITCHBOARD: A User's Manual," http://www.cis.upenn.edu/~ldc/readme_files/switchbrd.readme.html, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pennsylvania, USA, December 1995. [15] B. Byrne, M. Finke, S. Khudanpur, J. McDonough, H. Nock, M. Riley, M. Saraclar, C. Wooters and G. Zavaliagkos, "Pronunciation Modelling for Conversational Speech Recognition: A Status Report from WS'97," Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, pp. 26-33, Santa Barbara, California, USA, December 1997. [16] M. Finke and A. Waibel, "Flexible Transcription Alignment," Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, pp. 33-40, Santa Barbara, California, USA, December 1997. [17] A. Martin, J. Fiscus, W. Fisher, D. Pallet and M. Przybocki, "System Descriptions and Performance Summary," presented at the Conversational Speech Recognition Workshop: DARPA Hub-5E Evaluation, Baltimore, Maryland, USA, May 1997. [18] B. Byrne, M. Finke, S. Khudanpur, J. McDonough, H. Nock, M. Riley, M. Saraclar, C. Wooters and G. Zavaliagkos, "Pronunciation Modelling," presented at the 1997 Summer Workshop on Innovative Techniques for Large Vocabulary Conversational Speech Recognition, the Center for Language and Speech Processing, Johns Hopkins University, Baltimore, Maryland, USA, August 1997. [19] J. Hamaker, N. Deshmukh, A. Ganapathiraju and J. Picone, "Improved Monosyllabic Word Modeling on SWITCHBOARD," Institute for Signal and Information Processing, Mississippi State University, Mississippi State, Mississippi, USA, August 15, 1998. [20] J. Hamaker, Y. Zeng and J. Picone, "Rules and Guidelines for Transcription and Segmentation of the SWITCHBOARD Large Vocabulary Conversational Speech Recognition Corpus," http://www.isip.msstate.edu/resources/technology/projects/current/switchboard/doc/transcription_guidelines, Institute for Signal and Information Processing, Mississippi State University, Mississippi State, Mississippi, USA, July 1998. [21] N. Deshmukh, J. Hamaker, A. Ganapathiraju, R. Duncan and J. Picone, "An Efficient Tool For Resegmentation and Transcription of Two-Channel Conversational Speech," http://www.isip.msstate.edu/resources/technology/software/1998/swb_segmenter, Institute for Signal and Information Processing, Mississippi State University, Mississippi State, Mississippi, USA, August 1998. Figure 1. The distribution of the duration of a conversation in SWB Figure 2. The distribution of the amount of data per speaker in SWB. Figure 3. Overview of the SWB segmentation and transcription process Figure 4. Screenshot of the segmentation tool. Figure 5. Overview of the quality control and data verification process. Figure 6. Flow of quality control scripts. Transcriber WER LDC 5.4% ISIP before revisions 3.7% ISIP after revisions 1.5% Table 1. Comparison of cross-validation error rates for the LDC and revised transcriptions. TABLE OF CONTENTS LIST OF FIGURES LIST OF TABLES TABLE Page 1. Survey of tooth density, composition and moods over a period of one hundred days 4 2. f(x) = x, evaluated at x = 0, 1, 2, 3, and 4 6 DEDICATION ii ACKNOWLEDGMENTS iii LIST OF TABLES v LIST OF FIGURES vi CHAPTER I. INTRODUCTION 1 II. HISTORICAL PERSPECTIVE 4 III. METHODOLOGY 11 IV. CORPUS ANALYSIS 26 V. EXPERIMENTS 28 VI. FUTURE WORK 32 VII. CONCLUSIONS 33 VIII. REFERENCES 8 FIGURE Page 1. Smile and the world smiles with you. 4 2. Smile and the world smiles with you. 5