Clustering acoustic parameter trajectories

Motivation

Making the data obey the Markov assumption

Mixture models assume that the past and future frames are conditionally independent on the present frame.

Speech violates this assumption.

Speakers have different pronunciations, speaking styles, etc.

Identifying these "modalities" and building separate models for them minimizes the violation of the Markov assumption.

Recovering context for syllables

Tri-syllables are impractical.

There are not enough training samples:

  3
40

versus

   3
800

Context may be a modality for particular syllables.

Reducing confusability

Mixture weights are meant to increase the likelihood of training data

But during recognition unobserved gaussian sequences also get a high likelihood.

What does the data show?

Spectral sequences

The tokens were arranged by looking at forced alignment of training data in the form:
<frame data> ::=
{most probable gaussian, acoustic score for that gaussian}
<state data> ::= {<frame data>+}
<token data> ::= {<state data>+}

All tokens were reduced to the same "duration" by identifying the most common gaussians in the frames of a state. Ties are broken by picking the gaussian with the highest score.

{ {{5,-78}, {6,-81}}, {{6,-83}, {6,-83}, {6,-80}}, {{0,-85}}}

Trajectories by gaussian index

Trajectories by gaussian best score

Research problem

Entropy of trajectory distribution = 7 out of 9 bits.

Highly probable paths share legs.

How should we cluster trajectories?

Two Solutions

Cluster using both spectral and temporal information

Data

Forced alignments of training data with our best hybrid system using syllables, monosyllabic words and triphones.

Use frame data in the form: {state, most probable mixture}, in the future would like to include acoustic vector.

Algorithm

Kmeans cluster the data.

"Distance" between two tokens i, and j = d[i,j] + d[j,i], where d[i,j] is obtained by linearly stretching smaller duration token to longer one (ideally, the stretch should be computed by using dynamic programming). Any mismatch in state label incurs a penalty of 2, a mismatch of gaussian for the same state incurs a penalty of 1.

Two out of the four clusters for "__they"

1st cluster: probability of occurrence in tokens = 0.19, duration = 20 frames.

{1,4} -> {7} -> {7,2,5} -> {1,8,6} -> {7,7,4,4,3,3,3,3} -> {3,3,5}

2nd cluster: probability = 0.27, duration = 12 frames.

{6,6,2,5,8} -> {1,1} -> {5,5} -> {2} -> {5,5}

Histogram of durations

HMM topology used for recognition

Multipath HMM

1st try for each path: finite duration with skip states.

Results

Unavailable because of problems with HTK.

Cluster by identifying common legs

A given syllable has n states.

Each token has n-1 transitions.

Use the dominant mixture sequence to minimize alignment problems.

__a

__or

Graphical models

There are two extremes that your data could have.

One modality for all tokens or no modalities.

Either one leads to a good HMM model.

Real data probably has both. Need to tease them out.

Conditional independence accommodates both extremes.

Knowing Z, reading Y is irrelevant for reading X.

Create an independence graph for the random vectors

{g
  1,

g
 2,

..., g }
      n

. That is calculate quantities of the form

Information[i \[UpTee] j | K/{i, j}]

for each edge of the graph. Concentrate on high valued edges.

Results

Unimplemented due to time.