# Clustering acoustic parameter trajectories

### Motivation

#### Making the data obey the Markov assumption

• Mixture models assume that the past and future frames are conditionally independent on the present frame.
• Speech violates this assumption.
• Speakers have different pronunciations, speaking styles, etc.
• Identifying these "modalities" and building separate models for them minimizes the violation of the Markov assumption.

#### Recovering context for syllables

• Tri-syllables are impractical.
• There are not enough training samples:
`  340`
versus
`   3800`
.
• Context may be a modality for particular syllables.

#### Reducing confusability

• Mixture weights are meant to increase the likelihood of training data
• But during recognition unobserved gaussian sequences also get a high likelihood.

### What does the data show?

#### Spectral sequences

• The tokens were arranged by looking at forced alignment of training data in the form:
<frame data> ::=
{most probable gaussian, acoustic score for that gaussian}
<state data> ::= {<frame data>+}
<token data> ::= {<state data>+}
• All tokens were reduced to the same "duration" by identifying the most common gaussians in the frames of a state. Ties are broken by picking the gaussian with the highest score.
• { {{5,-78}, {6,-81}}, {{6,-83}, {6,-83}, {6,-80}}, {{0,-85}}}

##### Research problem

• Entropy of trajectory distribution = 7 out of 9 bits.
• Highly probable paths share legs.
• How should we cluster trajectories?

### Two Solutions

#### Cluster using both spectral and temporal information

##### Data

• Forced alignments of training data with our best hybrid system using syllables, monosyllabic words and triphones.
• Use frame data in the form: {state, most probable mixture}, in the future would like to include acoustic vector.

##### Algorithm

• Kmeans cluster the data.
• "Distance" between two tokens i, and j = d[i,j] + d[j,i], where d[i,j] is obtained by linearly stretching smaller duration token to longer one (ideally, the stretch should be computed by using dynamic programming). Any mismatch in state label incurs a penalty of 2, a mismatch of gaussian for the same state incurs a penalty of 1.

##### 1st cluster: probability of occurrence in tokens = 0.19, duration = 20 frames.

{1,4} -> {7} -> {7,2,5} -> {1,8,6} -> {7,7,4,4,3,3,3,3} -> {3,3,5}

##### 2nd cluster: probability = 0.27, duration = 12 frames.

{6,6,2,5,8} -> {1,1} -> {5,5} -> {2} -> {5,5}

##### HMM topology used for recognition

• Multipath HMM
• 1st try for each path: finite duration with skip states.

##### Results

• Unavailable because of problems with HTK.

#### Cluster by identifying common legs

• A given syllable has n states.
• Each token has n-1 transitions.
• Use the dominant mixture sequence to minimize alignment problems.

##### Graphical models

• There are two extremes that your data could have.
• One modality for all tokens or no modalities.
• Either one leads to a good HMM model.
• Real data probably has both. Need to tease them out.
• Conditional independence accommodates both extremes.
`{g  1,`
`g 2,`
`..., g }      n`
`Information[i \[UpTee] j | K/{i, j}]`