Integrating Prosody, Speech Recognition, Parsing in Spoken-Language Information Retrieval

About

One of the central goals of this project is to integrate natural language parsing, which has been largely developed with respect to written texts, with speech recognition. We have demonstrated that parsing technology can be successfully applied to speech transcripts and we have shown that the kinds of syntactic structures posited by a statistical parser can form the basis for a high-performance language model. These results suggest that a combined speech recognition/parsing system should perform extremely well. There is still a substantial amount of engineering and scientific work to be performed before we have achieved that integration.

Currently we are investigating just what the interface between the speech recognition and parsing components should be in a combined system. It turns out that the basic data structures in each component lattices in speech recognition, charts in parsing are in principle quite compatible; theoretically at least one could imagine running a parser in parallel with an acoustic model. This is a bold and attractive architecture, but we suspect that at the current stage it is impractical; the number of word hypotheses would simply overwhelm the parser. We are thus investigating ways of pruning the hypothesis space, perhaps by using a standard trigram language model, and of compacting the set of hypotheses, perhaps by using sausages instead of lattices; probably some combination of the two will turn out to be viable.

This material is based upon work supported by the National Science Foundation under Grant No. IIS0095940. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.