Clustering acoustic parameter trajectories

Motivation

Making the data obey the Markov assumption

  • Mixture models assume that the past and future frames are conditionally independent on the present frame.
  • Speech violates this assumption.
  • Speakers have different pronunciations, speaking styles, etc.
  • Identifying these "modalities" and building separate models for them minimizes the violation of the Markov assumption.

    Recovering context for syllables

  • Tri-syllables are impractical.
  • There are not enough training samples:
      3
    40
    versus
       3
    800
    .
  • Context may be a modality for particular syllables.

    Reducing confusability

  • Mixture weights are meant to increase the likelihood of training data
  • But during recognition unobserved gaussian sequences also get a high likelihood.

    What does the data show?

    Spectral sequences

  • The tokens were arranged by looking at forced alignment of training data in the form:
    <frame data> ::=
    {most probable gaussian, acoustic score for that gaussian}
    <state data> ::= {<frame data>+}
    <token data> ::= {<state data>+}
  • All tokens were reduced to the same "duration" by identifying the most common gaussians in the frames of a state. Ties are broken by picking the gaussian with the highest score.
  • { {{5,-78}, {6,-81}}, {{6,-83}, {6,-83}, {6,-80}}, {{0,-85}}}

    Trajectories by gaussian index
    Trajectories by gaussian best score
    Research problem

  • Entropy of trajectory distribution = 7 out of 9 bits.
  • Highly probable paths share legs.
  • How should we cluster trajectories?

    Two Solutions

    Cluster using both spectral and temporal information

    Data

  • Forced alignments of training data with our best hybrid system using syllables, monosyllabic words and triphones.
  • Use frame data in the form: {state, most probable mixture}, in the future would like to include acoustic vector.

    Algorithm

  • Kmeans cluster the data.
  • "Distance" between two tokens i, and j = d[i,j] + d[j,i], where d[i,j] is obtained by linearly stretching smaller duration token to longer one (ideally, the stretch should be computed by using dynamic programming). Any mismatch in state label incurs a penalty of 2, a mismatch of gaussian for the same state incurs a penalty of 1.

    Two out of the four clusters for "__they"
    1st cluster: probability of occurrence in tokens = 0.19, duration = 20 frames.

    {1,4} -> {7} -> {7,2,5} -> {1,8,6} -> {7,7,4,4,3,3,3,3} -> {3,3,5}

    2nd cluster: probability = 0.27, duration = 12 frames.

    {6,6,2,5,8} -> {1,1} -> {5,5} -> {2} -> {5,5}

    Histogram of durations
    HMM topology used for recognition

  • Multipath HMM
  • 1st try for each path: finite duration with skip states.

    Results

  • Unavailable because of problems with HTK.

    Cluster by identifying common legs

  • A given syllable has n states.
  • Each token has n-1 transitions.
  • Use the dominant mixture sequence to minimize alignment problems.

    __a
    __or
    Graphical models

  • There are two extremes that your data could have.
  • One modality for all tokens or no modalities.
  • Either one leads to a good HMM model.
  • Real data probably has both. Need to tease them out.
  • Conditional independence accommodates both extremes.
  • Knowing Z, reading Y is irrelevant for reading X.
  • Create an independence graph for the random vectors
    {g
    1,
    g
    2,
    ..., g }
    n
    . That is calculate quantities of the form
    Information[i \[UpTee] j | K/{i, j}]
    for each edge of the graph. Concentrate on high valued edges.

    Results

  • Unimplemented due to time.