Glossary / Fundamentals / Production / Tutorials / Software / Home

An excellent resource for definition of common terms in computer science is the NIST Dictionary of Algorithms and Data Structures. This is a very comprehensive technical resource. Below, we provide definitions of terms which are somewhat unique to the speech research community and used frequently in this tutorial.

acoustic model	model used by a speech recognizer for decoding language spoken by a person and modeling numerically how the language sounds when spoken in a form that can be stored on a computer.
annotation graph	a formal framework for representing linguistic annotations of time series data. Annotation graphs abstract away from file formats, coding schemes and user interfaces, providing a logical layer for annotation systems.

Bayes' Rule	an equation that expresses a decomposes a posterior probability (e.g., P(W/A)) into the product of a conditional likelihood (e.g., P(A/W)) and a prior (e.g., P(W)) divided by a likelihood (e.g., P(A)). From a high-level point of view, this rule provides a way to combine new data with existing knowledge. It also provides a theory for learning or training intelligent systems.
best-first search	search algorithm that uses an evaluation function, h(N), to indicate the relative goodness of pursuing a node. It evaluates hypotheses as they evolve.
big-endian byte order	byte order where the first byte (at the lowest storage address) in a sequence is the most significant.
biphone	contained in context dependent models. It models left or right context.
breadth-first search	search algorithm that explores all alternatives simultaneously level-by-level.

cepstral mean subtraction	technique addressing distortions. It subtracts the mean cepstral value from each feature vector and then produces a normalized cepstrum vector which can better capture the acoustics where recognition occurs.
cepstrum	transforms the log-spectrum of the speech signal, thus simulating human hearing above certain frequencies.
client	computer or program that can download files for manipulation, run applications, or request application-based services from a file server.
clustering	model parameters are initialized using sufficient statistics estimated from different regions of the training data.
coarticulation	Variation in a phenome due to the influence of neighboring phenomes.
Concurrent Versions System (CVS)	open-source network-transparent version control system.
confusion pairs	a pair of words which have been identified by the scoring software as a word likely to be misrecognized (first word in the pai) by another word (second word in the pair). Confusion words represent diagnostic information that can be used to improve the performance of the system.
context dependent model	phone model that takes into account the phenomenon of coarticulation because a phone may be voiced differently depending on the other phones surrounding it.
context free grammar	grammars that allow production rules, which have only non-terminal symbols on the right-hand side, to increase their power and flexibility beyond grammars, but require a push-down automata.
context independent model	phone models that do not consider the influence of surrounding phonemes on the pronunciation of a given phoneme.
context sensitive grammar	grammars that allow production rules, which have terminal symbols on the left-hand side and the right-hand side, to represent the context of a word more specifically than lower levels, but require a more powerful automata to recognize sentences in the language.
continuous speech recognition	sequences of words that are not separated by a pause when spoken.
cross-validation	allows using the training database to validate recognition performance. One common method is called V-fold which divides the database into V equal parts. Each part serves as an independent test set, leaving the remaining (V-1) parts for training. This method is often used when the training data is limited.
cross-word	triphone models that extend across word boundaries.

decision trees	binary tree to classify target objects by asking binary questions in a hierarchical manner.
decoder	also known as a recognizer or evaluation. This module implements Bayes' rule and produces the most probable word sequence. It can be implemented in many different ways including Viterbi beam search and a stack search.
deletion	the type of error in which the recognizer's hypothesis doesn not contain a word in the reference transcription. The frequency of deletion and insertion errors can be controlled by an insertion penalty.
depth-first search	search algorithm that explores a single path until it reaches its conclusion.
digital signal processing (DSP)	analysis of signals in digital form to obtain useful information. For speech signals, the information extracted includes attributes needed by the speech recognizer. It is also known as feature extraction or front-end processing.
downsample	reducing the sample rate of a digital sound file.

energy	an attribute of a signal that corresponds to the magnitude of the speech signal.
energy normalization	technique addressing normalization where energy is computed as the log of the signal energy.
enlistment	a programmer's copy of the source code development environment which is used to modify and debug code. Often this is stored in a user's local environment, and not readily accessible by other software developers.

Fast Fourier Transform	computationally fast method to compute a Fourier Transform.
feature	attribute of speech needed by the recognizer to differentiate words and phonemes.
feature extraction	process of measuring certain attributes of speech needed by the speech recognizer to differentiate phonemes of a word. It is also known as front-end processing and signal processing.
feature stream	the speech signal can be decomposed into a sequence of feature vectors, typically spaced 10 ms in time, that represent a parameterization of the salient information in the signal.
feature vector	list of numerical measurements of speech attributes.
finite impulse response	a digital filter consisting of a transfer function that has only zeroes, thereby creating an impulse response that is finite in duration.
finite state machine	machine giving the probabilities of being in a state at a particular time in the past, based on direct observation.
flat-start	simple and effective technique used to initialize an acoustic model. It computes the global mean variance from the training data and sets the model parameters to these values.
foundation classes	a hierarchy of classes that provide a rich programming environment loaded with useful classes such as I/O, vectors, matrices, data structures, and algorithms.
Fourier Transform	extracts the frequency components of a signal in the time domain.
frame	interval over which features are measured.
frequency domain	characteristics of a digital signal pertaining to frequency spectrum.
front-end processing	algorithms applied to extract features needed by the speech recognizer; also known as feature extraction and signal processing.
fully qualified filename	a filename that includes the complete path to the file (e.g., "/home/jdoe/foo.text" is a fully qualified version of "foo.text").

GNU General Public License	intended to guarantee your freedom to share and change free software.
GUI Graphical User Interface	creates and configures the speech input format, the algorithms for extracting features, and the output format in a signal flow graph.
Gaussian mixture model	a statistical model in which the overall probability distribution is synthesized from a weighted sum of individual Gaussian distributions. This is a very powerful form of statistical modeling since arbitrarily complex distributions can be approximated with a parametrically controlled amount of precision.

GUI Graphical User Interface	creates and configures the speech input format, the algorithms for extracting features, and the output format in a signal flow graph.
header	information used to determine the file format and the specific details about that file.
Hidden Markov Models	statistical technique yielding the statistical likelihood that a particular sound was produced given a known word was spoken. The models are based on a Markov Chain which describes a sequence of random variables, each conditionally dependent on the previous variable.
HMM trellis	physical representation of the hypothesis space as it unfolds in time.

initialization	process that sets the model parameters to some initial values before training the acoustic models. It also facilitates convergence on a solution more quickly.
insertion	the type of error in which the recognizer produces a word (or symbol) hypothesis that does not correspond to any word in the reference transcription. Insertion errors often occur when the recognizer outputs two symbols that correspond to one symbol. One of these symbols will be tagged as a substitution error when non-time-aligned scoring is used, and the other will be tagged as an insertion error. Insertion errors are also common when noise is mistakenly recognized as speech.
isolated word recognition	occurs when the speaker must pause after each word spoken.

Java Speech Grammar Format

The Java^TM Speech Grammar Format is a platform-independent, vendor-independent textual representation of grammars for use in speech recognition. Grammars are used by speech recognizers to determine what the recognizer should listen for, and so describe the utterances a user may say. JSGF adopts the style and conventions of the Java programming language in addition to use of traditional grammar notations.

language model	specifies the order in which words are likely to occur
Large Vocabulary Speech Recognition (LVSR)	system in English that contains a list of words and phones. It typically uses phone models because of its vast vocabulary.
lexicon	a list of the words that can be recognized by a speech recognition system along with their pronunciation, or expansion into some fundamental set of units corresponding to the acoustic models
linear prediction	a mathematical operation where future values of a discrete-time signal are estimated as a linear function of previous samples
little-endian byte order	byte order where the first byte (at the lowest storage address) in a sequence is the least significant.
log-spectrum	used to simulate the way humans hear sounds above certain frequencies.
low pass filter	filter that removes high frequency signals and allows low frequency signals to pass.

Mel-Frequency Cepstrum Coefficients (MFCC)	method that analyzes how the Fourier transform extracts frequency components of a signal in the time-domain.
mixture splitting	process of splitting an existing mixture into N other mixtures based on some technique or algorithm (clustering, variance splitting, etc...).

N-gram	finite number of previous words used to predict a set of next words.
NIST	National Institute of Standards and Technology. An organization that conducts third-party evaluation of human language technology. The < a href="http://www.nist.gov/speech/">NIST web site contains many useful resources including a scoring package called SCLITE that provides an industry-standard means for scoring speech recognition results.
natural byte order	byte order native to a certain system.
natural language processing	provides a source of knowledge needed by the recognizer or language model.
Network	computes next words from a probable path through a finite state network. It decodes the set of next words.

parse	refers to the problem of determining if a given sequence could have been generated from a given state machine.
phone model	model that obtains certain phonemes in order to create a complete model of a word.
phoneme	any of the small units of speech sound in a language that assists to distinguish one word from another; (Ex. The phoneme, aa, is the a sound in father, and the phoneme, jh, is the j sound in joy.)
pipe	SunOs allows you to send the output of one program to another program. Use the "\|" (pipe) character to do this. (Ex. command1 \| command2 sends the output of the program command1 to command2.)
pruning	process that removes unlikely paths from consideration and saves resource usage in both memory and time.

RAW file	binary audio data in big endian or little endian byte order.
Revision Control System (RCS)	manages multiple revisions of files.
recipe	single entity that stores information from each component within a signal flow graph.
recognition error	an error of a speech recognition system.
reestimation	phase in training known as the refinement process that begins after the acoustic models have been seeded with initial values. It applies special algorithms to reestimate the model parameters until convergence occurs.
reference transcription	To evaluate a speech recognition system, the output hypothesis must be compared to the "answer", known as the reference transcription.
regular expressions	a language that is used to describe patterns to be matched when searching over large repositories of data. Regular expressions are the backbone of the UNIX operating system, and supported by tools such as egrep and bash. Several publicly available portable libraries exist to support such interfaces.
regular grammar	grammars that requires every production rule contain at least one terminal on the right-hand side.
repository	an archive on our server that is used by the configuration management software to maintain all versions of our software. Specific versions of the software, including the most recent version under development, can be retrieved using the configuration management software.
rsh	command that lets you execute another command on a remote system and get the output back to your local system.

sample	a digitized audio segment taken from an original recording and inserted, often repetitively, in a new digital recording.
scalar classes	the classes that represent the fundamental math building blocks of the IFC environment. These classes perform the same functions as their counterparts in standard C/C++ programming languages (int, short, long, float, double, ...), but add the ability to read and write themselves to an object-oriented file system known as Signal Object File (Sof).
scoring	the process by which a recognition system's output is compared to reference transcriptions containing the "correct" answer. Errors are tabulated and presented in a format that can help a user understand the deficiencies of the system. NIST distributes a scoring package that is widely used within the community.
server	computer that provides client stations with access to files and printers as shared resources to a computer network.
signal flow graph	graphical representation of an input source receiving a signal, passing the signal to algorithms for processing, and producing an output with data from the signal.
signal modeling	process of representing a signal based on some defined model that is useful to the system
smoothing	SRI tool used during training to provide users a way of generating broader language models. It allows all words sequences to occur with some probability.
Sof	(Signal Object File) ISIP's internal format for storing any type of C++ data. The file is essentially an indexing scheme that keeps track of the locations of all object in the file. Files can be stored in a binary or text format.
spectrogram	visual display of vocal frequencies measured over some window of time.
speech file	any sound file containing human speech
speech recognizer	computer program that attempts to decode digital speech.
stack search	A depth-first approach to search in speech recognition. Extremely useful for N-best hypothesis generation.
state-tying	occurs when phones are in similar states. The states are tied together because of the sparsity of training data. It reduces system complexity and allows synthesis of unseen models.
substitution	the type of error in which a word in the reference transcription is replaced by an incorrect word in the recgonizer's hypothesis. Technically, the start and stop times of the hypothesis must overlap with the referene string for an error to be counted as a substitution.

tar file	an archive file format that is used to store many files and directories in an efficient and portable manner. The acronym tar represents tape archive file, and dates back to the use of magnetic tape. Today, tar is one of the most common mechanisms used to transfer groups of files and directories from one machine to another.
time domain	characteristics of a digital signal pertaining to its change over time.
time-synchronous	refers to an approach in the speech recognition search process in which all active hypotheses are extended one frame at a time as each new feature vector arrives.
training	process that wants to converge on a solution yielding the most likely sequence of vectors for a given acoustic unit.
transcription error	may be the error of a human transcriber or the error of a computer transcribe.
triphone	contained in context dependent models. It models left and right context.

variance splitting

technique for successively splitting each mixture component in the distribution until the desired number of mixture components have been created. It will preserve the variance of the distribution while shifting the mean of the distribution by some factor of standard deviation.

Viterbi beam search

a suboptimal search algorithm based on the principle of dynamic programming in which the most promising hypotheses are maintained, and other hypotheses are discarded. The term "beam" is used because the analogy can be made with how you search around a dark room with a flashlight. The name Viterbi is used because this search approach is similar to Viterbi decoding, which is a special case of dynamic programming pioneered in communication systems. A Viterbi beam search is essentially a breadth-first suboptimal search in which only the most promising candidates are pursued. Several thresholds on the overall likelihoods of the hypotheses are applied to select the most promising candidates.

WAV	Microsoft WAV format, a format used in the Microsoft Windows operating system to store audio files. Support for WAV is provided through Silicon Graphics' Audio File Library.
waveform	a mathematical and visual representation of an analog wave, usually a graph obtained by plotting a characteristic of the wave against time.
window	a collection of samples surrounding a frame which takes the feature measurements and conveys a smoother representation of the speech data.
Word Error Rate (WER)	a measure of the accuracy of a speech recognition system that tabulates three types of errors: substitutions, deletions and insertions. This is typically computed using a standard set of tools provided by NIST.
word-internal	a triphone model that remains within word boundaries.
word model	a model for each of the phonemes produced for an entire word.

Glossary / Help / Support / Site Map / Contact Us / ISIP Home