In this tutorial, I will explain a novel capability of the
production system that allows multiple utterances per file to be
processed. This is very useful since it can significantly reduce
processing time. Some databases, such as SWITCHBOARD, consist of
long conversations, often 5 to 10 minutes in duration. In the
past, we have segmented such long data into short utterances
typically 5 to 10 secs in length, and then processed these
individual files. We call this a one utterance per file format.
This format is inefficient for standard computer processing
because the operating system will spend more time opening and
closing the file than it will spend processing the data in the
file. As computers have increased in speed, processing times are
often 0.1 xRT (0.1 secs per one second of speech data). The total
time it takes to process 500 hours of data in a one utterance per
file format is often limited by the I/O time - particularly the
time it takes for the disk to seek the data, open the file, etc.
To overcome this, we have developed a facility in which a single
file can contain multiple utterances. Hence, a conversation need
not be separated into lots of small files. The utterance segmentation
and transcription information is provided to the recognizer in the
form of a
transcription database
that contains both the start and stop times of each utterance
as well as the corresponding transcription for that segment.
This database is described in much more detail in our
online speech recognition tutorial.
1. An Overview of the Speech Recognition Process
A basic block diagram of the speech recognition process is shown
to the right. There are two main processes involved in decoding
of a speech file:
- Feature Extraction:
Only certain attributes or features of a person's speech are
helpful for decoding sound units, called phones. These features
are extracted using a combination of spectral and temporal
measurements. See
feature extraction
for more information on how to convert an audio file to
a feature file.
- Recognition:
The process of finding the most probable set of symbols,
typically words, for an utterance is known as recognition
(or decoding). For more information on recognition, see
recognition.
|
|
2. Multiple Utterances Per File
A file containing multiple utterances is shown to the right. As
mentioned above, some data, such as recordings of telephone
conversations, naturally lends itself to such a file organization.
Other data can be put in this format to speed up processing time
by decreasing I/O wait times. The features are calculated
separately for each utterance using the timing information in the
transcription database. The recognizer can read sampled audio data
or feature files in this multiple utterance per file format.
|
|
3. Experimental Setup:
Below, we lead you through a simple example of how to execute
an experiment using this capability. The main components are:
- Audio File: A file with multiple utterances.
- Audio Database: This database manages a set of audio
files using a simple database indexing scheme in which each
file is identified by a key (referred to as an audio
id). To learn more about audio database and how to create
one, see our
online tutorial.
- Transcription Database: This database manages the
transcriptions for each file. It plays a key role in feature
extraction of multiple utterance since it has the timing
information (start and stop times) for each utterance.
To learn more about transcription database and how to create one,
see our
online tutorial.
- Acoustic Models: These are used by the recognizer
to perform statistical modeling of the input signal.
To learn more about acoustic models, see our
online tutorial.
- Language Models: While the acoustic models built from
the extracted features enable the recognizer to decode phonemes
that comprise words, the language models specify the order in
which the sequence of words is likely to occur. To learn more
about language models, see our
online tutorial.
- Recipe Files: A recipe is a single entity that stores
information about how we convert a speech signal into a
sequences of vectors in a signal flow graph. To learn how to
create a recipe, see our
online tutorial.
All of this information is encapsulated in a parameter file.
To learn more about configuration of the recognizer,
I would recommend you follow several exercises
available in our
online tutorial.
Here are some sample commands that can be used to execute the
steps described above:
- To create an audio and transcription database:
isip_make_db -db both -audio audio_list.text
-trans trans_list.text -level \
word -name TIDigits -type text audio_db.sof trans_db.sof
- To run recognition:
isip_recognize -parameter_file params/params_decode.sof \
-list lists/identifiers.sof
In the future, we plan to release more experimental setups that
make use of this feature.