|
A:
acoustic model
|
model used by a speech recognizer for decoding language spoken
by a person and modeling numerically how the language
sounds when spoken in a form that can be stored on a computer.
|
annotation graph
|
a formal framework for representing linguistic annotations of time series
data. Annotation graphs abstract away from file formats, coding schemes
and user interfaces, providing a logical layer for annotation systems.
|
|
|
B:
Bayes' Rule
|
an equation that expresses a decomposes a posterior probability
(e.g., P(W/A)) into the product of a conditional likelihood
(e.g., P(A/W)) and a prior (e.g., P(W)) divided by a likelihood
(e.g., P(A)). From a high-level point of view, this rule provides
a way to combine new data with existing knowledge. It also
provides a theory for learning or training intelligent systems.
|
best-first search
|
search algorithm that uses an evaluation function, h(N), to indicate the
relative goodness of pursuing a node. It evaluates hypotheses as they
evolve.
|
big-endian byte order
|
byte order where the first byte (at the lowest storage address) in a
sequence is the most significant.
|
biphone
|
contained in context dependent models. It models left or right
context.
|
breadth-first search
|
search algorithm that explores all alternatives simultaneously
level-by-level.
|
|
|
C:
cepstral mean subtraction
|
technique addressing distortions. It subtracts the mean cepstral
value from each feature vector and then produces a normalized cepstrum
vector which can better capture the acoustics where recognition occurs.
|
cepstrum
|
transforms the log-spectrum of the speech signal, thus simulating human
hearing above certain frequencies.
|
client
|
computer or program that can download files for manipulation, run
applications, or request application-based services from a file
server.
|
clustering
|
model parameters are initialized using sufficient statistics estimated
from different regions of the training data.
|
coarticulation
|
Variation in a phenome due to the
influence of neighboring phenomes.
|
Concurrent Versions System (CVS)
|
open-source network-transparent
version control system.
|
confusion pairs
|
a pair of words which have been identified by the
scoring software as a word likely to be misrecognized
(first word in the pai) by another word (second word in
the pair). Confusion words represent diagnostic information
that can be used to improve the performance of the system.
|
context dependent model
|
phone model that takes into account the phenomenon of
coarticulation because a phone may be voiced differently depending on
the other phones surrounding it.
|
context free grammar
|
grammars that allow production rules, which have only non-terminal
symbols on the right-hand side, to increase their power and
flexibility beyond grammars, but require a push-down automata.
|
context independent model
|
phone models that do not consider the influence of surrounding
phonemes on the pronunciation of a given phoneme.
|
context sensitive grammar
|
grammars that allow production rules, which have terminal symbols on the
left-hand side and the right-hand side, to represent the context
of a word more specifically than lower levels,
but require a more powerful automata
to recognize sentences in the language.
|
continuous speech recognition
|
sequences of words that are not separated by a pause when spoken.
|
cross-validation
|
allows using the training database to validate recognition performance.
One common method is called V-fold which divides the database into V
equal parts. Each part serves as an independent test set, leaving the
remaining (V-1) parts for training. This method is often used when the
training data is limited.
|
cross-word
|
triphone models that extend across word boundaries.
|
|
|
D:
decision trees
|
binary tree to classify target objects by asking binary questions
in a hierarchical manner.
|
decoder
|
also known as a recognizer or evaluation. This module implements
Bayes' rule and produces the most probable word sequence. It
can be implemented in many different ways including Viterbi beam
search and a stack search.
|
deletion
|
the type of error in which the recognizer's hypothesis doesn
not contain a word in the reference transcription.
The frequency of deletion and insertion errors can be controlled by an
insertion penalty.
|
depth-first search
|
search algorithm that explores a single path until it reaches its
conclusion.
|
digital signal processing (DSP)
|
analysis of signals in digital form to obtain useful
information. For speech signals, the information extracted includes
attributes needed by the speech recognizer. It is also known as
feature extraction or front-end processing.
|
downsample
|
reducing the sample rate of a digital sound file.
|
|
|
E:
energy
|
an attribute of a signal that corresponds to the magnitude of the
speech signal.
|
energy normalization
|
technique addressing normalization where energy is computed as the log
of the signal energy.
|
enlistment
|
a programmer's copy of the source code development environment
which is used to modify and debug code. Often this is stored in a
user's local environment, and not readily accessible by other software
developers.
|
|
|
F:
Fast Fourier Transform
|
computationally fast method to compute a Fourier Transform.
|
feature
|
attribute of speech needed by the recognizer to differentiate words and
phonemes.
|
feature extraction
|
process of measuring certain attributes of speech needed by the speech
recognizer to differentiate phonemes of a word. It is also known as
front-end processing and signal processing.
|
feature stream
|
the speech signal can be decomposed into a sequence of feature vectors,
typically spaced 10 ms in time, that represent a parameterization of the
salient information in the signal.
|
feature vector
|
list of numerical measurements of speech attributes.
|
finite impulse response
|
a digital filter consisting of a transfer function that has only zeroes,
thereby creating an impulse response that is finite in duration.
|
finite state machine
|
machine giving the probabilities of being in a state at
a particular time in the past, based on direct observation.
|
flat-start
|
simple and effective technique used to initialize an acoustic model. It
computes the global mean variance from the training data and sets
the model parameters to these values.
|
foundation classes
|
a hierarchy of classes that provide a rich programming environment
loaded with useful classes such as I/O, vectors, matrices, data
structures, and algorithms.
|
Fourier Transform
|
extracts the frequency components of a signal in the time domain.
|
frame
|
interval over which features are measured.
|
frequency domain
|
characteristics of a digital signal pertaining to frequency spectrum.
|
front-end processing
|
algorithms applied to extract features needed by the speech recognizer;
also known as feature extraction and signal processing.
|
fully qualified filename
|
a filename that includes the complete path to the file
(e.g., "/home/jdoe/foo.text" is a fully qualified version
of "foo.text").
|
|
|
G:
GNU General Public License
|
intended to guarantee your freedom to share and change free software.
|
GUI Graphical User Interface
|
creates and configures the speech input format, the algorithms for
extracting features, and the output format in a signal flow graph.
|
Gaussian mixture model
|
a statistical model in which the overall probability
distribution is synthesized from a weighted sum of
individual Gaussian distributions. This is a very
powerful form of statistical modeling since arbitrarily
complex distributions can be approximated with
a parametrically controlled amount of precision.
|
|
|
H:
GUI Graphical User Interface
|
creates and configures the speech input format, the algorithms for
extracting features, and the output format in a signal flow graph.
|
header
|
information used to determine the file format and the specific details
about that file.
|
Hidden Markov Models
|
statistical technique yielding the statistical likelihood that a
particular sound was produced given a known word was spoken. The
models are based on a Markov Chain which describes a sequence of
random variables, each conditionally dependent on the
previous variable.
|
HMM trellis
|
physical representation of the hypothesis space as it unfolds in time.
|
|
|
I:
initialization
|
process that sets the model parameters to some initial values
before training the acoustic models. It also facilitates convergence
on a solution more quickly.
|
insertion
|
the type of error in which the recognizer produces a word (or symbol)
hypothesis that does not correspond to any word in the reference
transcription. Insertion errors often occur when the recognizer
outputs two symbols that correspond to one symbol. One of these
symbols will be tagged as a substitution error when non-time-aligned
scoring is used, and the other will be tagged as an insertion error.
Insertion errors are also common when noise is mistakenly recognized
as speech.
|
isolated word recognition
|
occurs when the speaker must pause after each word spoken.
|
|
|
J:
Java Speech Grammar Format
|
The JavaTM Speech
Grammar Format is a platform-independent, vendor-independent
textual representation of grammars for use in speech
recognition. Grammars are used by speech recognizers
to determine what the recognizer should listen for,
and so describe the utterances a user may say. JSGF
adopts the style and conventions of the Java programming
language in addition to use of traditional grammar
notations.
|
|
|
K:
|
|
L:
language model
|
specifies the order in which words are likely to occur
|
Large Vocabulary Speech Recognition (LVSR)
|
system in English that contains a list of words and phones. It
typically uses phone models because of its vast vocabulary.
|
lexicon
|
a list of the words that can be recognized by a speech recognition
system along with their pronunciation, or expansion into some
fundamental set of units corresponding to the acoustic models
|
linear prediction
|
a mathematical operation where future values of a discrete-time signal
are estimated as a linear function of previous samples
|
little-endian byte order
|
byte order where the first byte (at the lowest storage address) in a
sequence is the least significant.
|
log-spectrum
|
used to simulate the way humans hear sounds above certain frequencies.
|
low pass filter
|
filter that removes high frequency signals and allows low frequency
signals to pass.
|
|
|
M:
Mel-Frequency Cepstrum Coefficients (MFCC)
|
method that analyzes how the Fourier transform extracts frequency
components of a signal in the time-domain.
|
mixture splitting
|
process of splitting an existing mixture into N other mixtures based
on some technique or algorithm (clustering, variance splitting, etc...).
|
|
|
N:
N-gram
|
finite number of previous words used to predict a set of next words.
|
NIST
|
National Institute of Standards and Technology.
An organization that conducts third-party evaluation of
human language technology. The
< a href="http://www.nist.gov/speech/">NIST web site
contains many useful resources including a scoring package called
SCLITE
that provides an industry-standard means for scoring speech
recognition results.
|
natural byte order
|
byte order native to a certain system.
|
natural language processing
|
provides a source of knowledge needed by the recognizer or language
model.
|
Network
|
computes next words from a probable path through a finite state network.
It decodes the set of next words.
|
|
|
O:
|
|
P:
parse
|
refers to the problem of determining if a given sequence could have been
generated from a given state machine.
|
phone model
|
model that obtains certain phonemes in order to create a complete model
of a word.
|
phoneme
|
any of the small units of speech sound in a language that assists to
distinguish one word from another; (Ex. The phoneme, aa, is the a
sound in father, and the phoneme, jh, is the j sound in joy.)
|
pipe
|
SunOs allows you to send the output of one program to another
program. Use the "|" (pipe) character to do this. (Ex. command1 |
command2 sends the output of the program command1 to command2.)
|
pruning
|
process that removes unlikely paths from consideration and saves
resource usage in both memory and time.
|
|
|
Q:
|
|
R:
RAW file
|
binary audio data in big endian or
little endian byte order.
|
Revision Control System (RCS)
|
manages multiple
revisions of files.
|
recipe
|
single entity that stores information from each component
within a signal flow graph.
|
recognition error
|
an error of a speech recognition system.
|
reestimation
|
phase in training known as the refinement process that begins after the
acoustic models have been seeded with initial values. It applies
special algorithms to reestimate the model parameters until
convergence occurs.
|
reference transcription
|
To evaluate a speech recognition system, the output
hypothesis must be compared to the "answer", known as
the reference transcription.
|
regular expressions
|
a language that is used to describe patterns to be matched when
searching over large repositories of data. Regular expressions
are the backbone of the UNIX operating system, and supported by
tools such as egrep and bash. Several publicly available portable
libraries exist to support such interfaces.
|
regular grammar
|
grammars that requires every production rule contain at least one
terminal on the right-hand side.
|
repository
|
an archive on our server that is used by the configuration
management software to maintain all versions of our software.
Specific versions of the software, including the most recent version
under development, can be retrieved using the configuration management
software.
|
rsh
|
command that lets you execute another command on a remote system
and get the output back to your local system.
|
|
|
S:
sample
|
a digitized audio segment taken from an original recording and
inserted, often repetitively, in a new digital recording.
|
scalar classes
|
the classes that represent the fundamental math building blocks of the
IFC environment. These classes perform the same functions as
their counterparts in standard C/C++ programming languages
(int, short, long, float, double, ...), but add the ability to
read and write themselves to an object-oriented file system
known as Signal Object File (Sof).
|
scoring
|
the process by which a recognition system's output is
compared to reference transcriptions containing the "correct"
answer. Errors are tabulated and presented in a format
that can help a user understand the deficiencies of the system.
NIST distributes a scoring package that is widely used
within the community.
|
server
|
computer that provides client stations with access to files and printers
as shared resources to a computer network.
|
signal flow graph
|
graphical representation of an input source receiving a signal, passing
the signal to algorithms for processing, and producing an output with
data from the signal.
|
signal modeling
|
process of representing a signal based on some defined model that is
useful to the system
|
smoothing
|
SRI tool used during training to provide users a way of generating
broader language models. It allows all words sequences to occur with
some probability.
|
Sof
|
(Signal Object File)
ISIP's internal format for storing any type
of C++ data. The file is essentially an indexing scheme that
keeps track of the locations of all object in the file.
Files can be stored in a binary or text format.
|
spectrogram
|
visual display of vocal frequencies measured over some window of time.
|
speech file
|
any sound file containing human speech
|
speech recognizer
|
computer program that attempts to decode digital speech.
|
stack search
|
A depth-first approach to search in speech recognition.
Extremely useful for N-best hypothesis generation.
|
state-tying
|
occurs when phones are in similar states. The states are tied together
because of the sparsity of training data. It reduces system complexity
and allows synthesis of unseen models.
|
substitution
|
the type of error in which a word in the reference transcription
is replaced by an incorrect word in the recgonizer's hypothesis.
Technically, the start and stop times of the hypothesis must overlap
with the referene string for an error to be counted as a substitution.
|
|
|
T:
tar file
|
an archive file format that is used to store many files and directories
in an efficient and portable manner. The acronym tar represents
tape archive file, and dates back to the use of magnetic
tape. Today, tar is one of the most common mechanisms used to transfer
groups of files and directories from one machine to another.
|
time domain
|
characteristics of a digital signal pertaining to its change over time.
|
time-synchronous
|
refers to an approach in the speech recognition search process in
which all active hypotheses are extended one frame at a time as
each new feature vector arrives.
|
training
|
process that wants to converge on a solution yielding the most likely
sequence of vectors for a given acoustic unit.
|
transcription error
|
may be the error of a human transcriber or the error of a computer
transcribe.
|
triphone
|
contained in context dependent models. It models left and right
context.
|
|
|
U:
|
|
V:
variance splitting
|
technique for successively splitting each mixture component in the
distribution until the desired number of mixture components have been
created. It will preserve the variance of the distribution while
shifting the mean of the distribution by some factor of standard
deviation.
|
Viterbi beam search
|
a suboptimal search algorithm based on the principle of dynamic
programming in which the most promising hypotheses are maintained,
and other hypotheses are discarded. The term "beam" is used
because the analogy can be made with how you search around a dark
room with a flashlight. The name Viterbi is used because this
search approach is similar to Viterbi decoding, which is a special
case of dynamic programming pioneered in communication
systems. A Viterbi beam search is essentially a breadth-first
suboptimal search in which only the most promising candidates
are pursued. Several thresholds on the overall likelihoods of
the hypotheses are applied to select the most promising candidates.
|
|
|
W:
WAV
|
Microsoft WAV format, a format used in the Microsoft Windows
operating system to store audio files. Support for WAV is provided
through Silicon Graphics'
Audio File Library.
|
waveform
|
a mathematical and visual representation of an analog wave, usually a
graph obtained by plotting a characteristic of the wave against time.
|
window
|
a collection of samples surrounding a frame which takes the feature
measurements and conveys a smoother representation of the speech data.
|
Word Error Rate (WER)
|
a measure of the accuracy of a speech recognition system that tabulates
three types of errors: substitutions, deletions and insertions. This is
typically computed using a standard set of tools provided by NIST.
|
word-internal
|
a triphone model that remains within word boundaries.
|
word model
|
a model for each of the phonemes produced for an entire word.
|
|
|
X:
|
|
Y:
|
|
Z:
|