WORDS: OBSERVABLE UNITS OF A LANGUAGE?
- Loosely defined as a lexical unit - there is an agreed upon
meaning in a given community.
- In many languages (e.g., Indo-European), easily observed in the
orthographic (writing) system since it is separated by white space.
- In spoken language, however, there is a segmentation problem: words
run together.
- Syntax: certain facts about word structure and combinatorial
possibilities are evident to most native speakers.
- Paradigmatic: properties related to meaning.
- Syntagmatic: properties related to constraints imposed
by word combinations (grammar).
- Word-level constraints are the most common form of "domain knowledge"
in a speech recognition system.
- N-gram models are the most common way to implement
word-level constraints.
- N-gram
distributions are very interesting!