/ Data / Fundamentals / Production / Tutorials / Software / Home
2.4.2 Auxiliary Resources: Audio and Transcription Databases
Section 2.4.2: Audio and Transcription Database

One of the more time-consuming aspects of speech recognition research is preparation and coordination of speech audio data and speech transcriptions. Often, experiments are aborted because the list of audio files does not match the list of transcriptions. Unless these two are tied together in some way, it is difficult to avoid such problems. Therefore, in our system, we provide a unique method for storing and accessing speech data and transcriptions through two related database representations, AudioDatabase and TranscriptionDatabase. These databases are created and manipulated using a single tool called isip_make_db.


AudioDatabase

Storage and access to speech data files is managed through an internally defined database format, AudioDatabase. This database manages a set of records. A record typically contains 1) a unique identifier, which we refer to as the id, and 2) the location of the speech file on disk. To obtain a record from the audio database, the id must be referenced.

Consider a collection of three files: ae_12a.sof, ae_1a.sof, and ae_2789385a.sof. We need to arrange these in a single file, called a list file, with corresponding ids. An example of such a file is audio_list.text.

Go to the directory:
    $ISIP_TUTORIAL/sections/s02/s02_04_p02/
We can convert this list file to and audio database using isip_make_db:

    isip_make_db -db audio -audio audio_list.text -name TIDigits -type text audio_db.sof
The first option, "-db", indicates the type of database you want. Currently available choices are "audio", "transcription" (which is described below), and "both". In this case, we selected "audio" since we want an audio database.

The second option, "-audio", provides the name of the listing file. This listing file typically contains a filename followed by a key. You can create these fairly easily using Unix commands such as "ls" and a programmable editor such as "emacs". The key is optional, in which case a unique key will be generated automatically. An example of a listing file is audio_list.text. This file contains the three filenames mentioned above and the corresponding ids (based on the file's basename in this example).

The third option, "-name", should be set to the name of the data. The fourth option, "-type", is used to generate either a text or binary Sof file. In this case we use "text" so we can view the output file by simply listing it. The last entry, which is the first argument, is the name of the output file which will contain an audio database. See audio_db.sof for the output from the example given above.

The database file contains four Sof objects: an AudioDatabase object, and three Filename objects which contain the names of the filenames included in this example. The AudioDatabse object encapsulates the database name (e.g., TIDigits), a list of ids, a mapping from ids to Filename object numbers. The ids link filenames to transcriptions described below.

Since the audio files are often located in a location different from the current working directory, it is useful to make these databases using filenames that contain work from any directory. The obvious way to do this is to use a fully qualified filename. For example, "ae_12a.sof" could be represented as "/isi./data/corpora/tidigits/ae_12a/sof". Another convenient way to do this is to use an environment variable. For example, the file named "ae_12a.sof" can be represented as "$TUTORIAL/ae_12a.sof" in the file audio_list.text. If the environment variable "$TUTORIAL" is properly set to "/isi./data/corpora/tidigits", then this file will be accessible from any location. The advantage of an environment variable is that the database can be moved to a new location and the only thing that needs to be updated is the environment variable.


Transcription Database

Transcriptions for the speech files in an audio database are managed by a TranscriptionDatabase. This database uses annotation graphs to represent the transcriptions, which typically consist of strings of words (though they can be much more complicated than that). The transcriptions are organized using the same key value used in the audio database. To obtain a transcription of a particular speech file in an audio database, the key for that particular data file must be referenced.

Continuing on the example described above, we can create a transcription list file many different ways using standard Unix commands and editors. For applications such as TIDigits, this is particularly simple because the transcriptions are encoded in the filename. An example of a transcription list file is provided in trans_list.text. This file contains fields of the form:

    key [start_time] [stop_time] [channel]: ... transcription ...
The key should match the corresponding audio file described above. The start and stop times are optional, and denote where the speech data begins and ends in the corresponding audio file. The channel index is used in the event that the audio file contains multiple channels (e.g., stereo). The field after ":" contains the desired transcription of the utterance.

The command to create a transcription database file from this data is:

    isip_make_db -db transcription -trans trans_list.text -level word -name TIDigits -type text trans_db.sof
We have introduced two new options here: (1) "-trans" instructs the command to generate a transcription database, and (2) "-level" assigns a tag to this transcription. The level tag will be discussed later when we introduce acoustic training (see Section 5) and recognition scoring (see scoring in Section 4).

The result transcription database can be viewed in trans_db.sof. This file contains a TranscriptionDatabase object and three AnnotationGraph objects. The latter contain the actual transcription along with the timing information. The former contains the ids used to reference individual AnnotationGraphs. The format of this object is the same as described above in audio_db.sof.

Note that both of these databases could have been built using a single command:

    isip_make_db -db both -audio audio_list.text -trans trans_list.text \
        -level word -name TIDigits -type text audio_db.sof trans_db.sof
This is the preferred way to run the command since it makes clear the fact that they key is the bridge between the two types of information.

The beauty of our database approach to handling file lists is that important subsets of a database are now simply referenced using lists of ids. In this way, we avoid the problem of mismatches between audio files and transcriptions. The audio and transcription databases are created once for the entire database, and users simply need to operate on the appropriate lists of ids. Common problems such as a missing transcription or an incorrect ordering of files, which cause mismatches between simpling listing files, are alleviated because there is just one file, a list of ids, that needs to be maintained.

For a more detailed explanation of isip_make_db, see our on-line documentation.
   
Table of Contents   Section Contents   Previous Page Up Next Page
      Glossary / Help / Support / Site Map / Contact Us / ISIP Home