EEM.ssr: Lab 2 - feature analysis and model initialisation

[ Home | Section: 1, 2, 3, 4 ]

Assessment: students should work individually and submit their files via SurreyLearn to Dr Jackson before 4pm on Tuesday 18 March 2014

Required software: HTK (v.3.4.1) and Matlab (v.7) are the recommended applications for this laboratory

Additional software: Praat (v.5.1 or higher) and SFS (v.4.7) for speech analysis, annotation and visualisation

Assessment:	students should work individually and submit their files via SurreyLearn to Dr Jackson before 4pm on Tuesday 18 March 2014
Required software:	HTK (v.3.4.1) and Matlab (v.7) are the recommended applications for this laboratory
Additional software:	Praat (v.5.1 or higher) and SFS (v.4.7) for speech analysis, annotation and visualisation

Aims of the Experiment

To gain experience of analysing features extracted from your recorded speech signals using Praat and the Hidden Markov Model Toolkit (HTK). To familiarise oneself with numerical manipulation using Matlab in order to obtain some initial parameter values for HMM word templates.

1. Preparation

In preparation for the lab, you should:

Make sure you have the speech files, word label files and MFCC files that you created in the previous practical.
Understand the meaning of the mean and covariance in the context of your 13-dimensional MFCC feature vectors.
Download a copy of the HTK book (for which you will need to register on their website), and read chapter 1. (You can skim parts of the mathematics in sections 1.4 and 1.5, which we will be covering later in the course, although you may find equations 1.10-1.12 helpful.)
Familiarise yourself with basic operations in Matlab, including functions: load, length, size, disp, sprintf, reshape, zeros, linspace, interp1, round, mean, cov, plot, imagesc, axis, xlabel, ylabel and print.

Here are some useful functions, examples and resources that you can find in the lab2 directory:

frq2mel.m, mel2frq.m - for converting between linear frequency (Hz) and perceived frequency or pitch (Mel)
readmfcc.m, mfccgram.m - for reading HTK format files and plotting MFCCs in a spectrogram-like display
readhmm.m, writehmm.m - for reading and writing HMMs in HTK format
config_WAV_MFCC_0 - an HTK configuration file for extracting MFCCs from your 16 kHz speech files (cf. Section 3.1.5, p.29 in HTK book: example configuration file)
digit-stats_SURNAME.xls - a spreadsheet template for filling in your mean and covariance values
proto1_39.5 - an example of an HMM prototype in HTK format, initialised with zero mean and unit variance (cf. Section 3.2.1, p.31 in HTK book: prototype HMM)

2. Overview

An overview of the tasks for this experiment is as follows:

Use HTK automatically to extract MFCC features from your set of recorded digits from the previous lab.
Read and convert the MFCCs into a spectrographic display.
Calculate essential statistics from your digit features.
Initialise HMM prototypes for each spoken digit.

A summary of the deliverables that are to be submitted is as follows:

Using HTK to extract features

MFCC feature files for your 11 digit recordings in HTK format (MFCC_0)

Plotting MFCC-based spectrograms

A plot of a spectrogram derived from MFCCs for the word "three" including word boundaries and axis labels, saved in a compact portable image format (e.g., JPEG, PNG or PDF)

Calculating statistics

An excel spreadsheet containing the mean and covariance for the word "eight" on one sheet, and the same for silence on a second sheet (in 97/2000/XP format)

Initialising HMM templates

One text file containing your twelve HMM prototype definitions (eleven digits plus silence), in HTK format

The deadline for submission of your files to Dr Jackson via SurreyLearn is 4pm on Tuesday 18 March 2014.

3. Experimental Work

3.1 Using HTK to extract features

Run HTK's HCopy over each of your 11 digit recordings using the configuration file provided for extracting 13 MFCCs including c₀ (i.e., from 0 to 12). For example,
HCopy -C config_WAV_MFCC_0 input.wav output.mfcc
You can do this either manually for each file, or by providing a script file (-S option) that lists your input WAV files and corresponding output file names. Note that if you simply type the command in a terminal window, it will return information about the tool's usage.
Use HList to view the details in the file header and the MFCCs for the first few frames of data, e.g.:
HList -h -e 2 output.mfcc
Fire up Matlab and use the function readmfcc to import the data from the HTK files. It requires you to specify its input file name, e.g.:
>> clear all; ifn='output.mfcc'; readmfcc
Check that the values it has read in correspond to those reported by HList.
Write a short Matlab script to read in HTK's MFCC features from one speech file and plot out the derived spectrogram using mfccgram, e.g.:
>> mfccgram(Y,0)
Make sure you have downloaded the functions mel2frq and frq2mel which it will need to warp and unwarp the frequency (between linear frequency and perceived frequency). I suggest you use the features from the word "three". An example is given at the bottom of Figure 1.
You need to submit MFCC feature files for your 11 digit recordings in HTK format (MFCC_0).

Figure 1: MFCC-based wave form and spectrogram for the utterance "one-three-four-five" by a female speaker (click to enlarge).

3.2 Plotting MFCC-based spectrograms

Next we are going to use the features that you previously extracted using Praat. An example for one of my speech files begins like this:

File type = "ooTextFile" Object class = "MFCC 1" 0 0.7153125 138 0.005 0.015156250000000001 100 2700 12 12 12.366817711479127 51.160400586296326 -69.9840039536427 1.4429484536722643 -38.498767684269815 0.9143073673417336 -4.481474945765951 -14.13759795259745 -10.311024674470207 -4.471976728500627 -0.6933540760388774 5.1883829569384465 -21.86388241674821 12 % type of Praat text file % Praat object type % % start time (s) % end time (s) % number of frames % frame step size (s) % analysis window size (s) % Mel filter - low frequency % Mel filter - high frequency % number of coefficients per frame % number of values in next block of data % c_0 % c_1 % c_2 % c_3 % c_4 % c_5 % c_6 % c_7 % c_8 % c_9 % c_10 % c_11 % c_12 % number of values in next block of data

The read these numbers into Matlab, one simple method involves deleting the first few lines in a text editor. I suggest you save the truncated text file with a new file name. You can then load the column of numbers directly into a vector using load in Matlab. Then, you will need to discard any remaining parts of the Praat header and use reshape to re-format this vector into a matrix, as you had before.

Write a short Matlab script to read in Praat's MFCC features from a speech file and plot out the derived spectrogram using mfccgram. Again, I suggest you use the ones extracted from the word "three". This time you will need to modify the mfccgram function since Praat lists the MFCCs in order from 0 to 12, whereas HTK goes from 1 to 12 and then appends the zeroth coefficient.
Adapt your script to take just one frame from the middle of the word, and plot the corresponding log magnitude spectrum.
Having read in the speech waveform (e.g., using wavread), compute the log magnitude spectrum for the same part of the word and add this to your plot. (Useful functions include log10, abs, fft and hamming. You may also wish to use subplot to divide your figure window in Matlab.)
Finally for this section, redraw the spectrographic display for this word in a subplot, add vertical lines to indicate the word boundaries, based on your previous annotation, and add a subplot of the speech waveform with the same word boundary markers. You can use hold on to fix the contents of a figure so you can add extra lines in Matlab. Be careful to take into account the shortening effect of the analysis window in the spectrogram, and set the offset accordingly.
You need to submit this plot of the spectrogram derived from MFCCs for the word "three" including word boundaries and axis labels, saved in a compact portable image format (e.g., to JPEG, PNG or PDF using Matlab's print function). Note that Matlab has commands xlabel, ylabel, title and text to enable annotation of its graphs.

3.3 Calculating statistics

Re-run HCopy over each of your 11 digit recordings adapting the configuration file provided to extract 39 MFCCs, including the log energy instead of c₀ together with the so-called deltas and accelerations (TARGETKIND = MFCC_E_D_A). Also change the shift between consecutive frames from 5 ms to 10 ms via the TARGETRATE parameter. Again, you can process the files either each one manually, or using a script file that lists your WAV files.
As with the previous HTK files, read the feature vectors for the word "three" into Matlab (e.g., with readmfcc). Now use your word annotations to segment the MFCC matrix into three parts: initial silence, the word, and the silence at the end. This will give you three smaller matrices. Concatenate the two silence segments into one combined matrix representing a series of all the silence frames, and set it aside for the moment. Using the word matrix, calculate the mean (39×1) and covariance (39×39) of the features.
Repeat the process for all your files so that you have mean and covariance statistics for each word, store these results, and combine all of the silence segments into one big matrix.
Calculate the mean and covariance for the silence frames, in the same way, and store the results.
You need to submit the excel spreadsheet containing the mean and covariance for your word "eight" on one sheet, and the same for silence on a second sheet (in 97/2000/XP format).

3.4 Initialising HMM templates

The final task in this assignment is to use the statistics that you have just computed in Section 3.3 to initialise some HMM templates. You will be using these in the next lab session as the basis of your recognition system. These prototypes are to be generated in HTK format, for which an example is provided (see Section 3.2.1, p.31 in HTK Book: prototype HMM). Open this file in a text editor and first make 12 copies of the prototype within the same text file, modifying the name of each model from "proto" to the appropriate word or silence: "zero", "oh", "one", ..., "nine" and "sil".
For each of the digits, replace the values in <Mean> with the corresponding mean vector calculated previously, in the same format (i.e., as a row of floating point numbers separated by spaces). Repeat this for all of the emitting states (i.e., states 2 to 6 because HTK also counts the null states 1 and 7).
Similarly, for the <Variance>, you need to take the diagonal elements of the covariance, and convert them into a row vector in the same format (see diag). Copy this for all emitting states within each word model.
The transition probabilities specify a strict left-right topology, and are fine for our purposes. So, hopefully you do not need to make any further changes. However, it is worth making a check of your file using the functions readhmm and writehmm.
For the silence model, trim down the model so that it only has 3 emitting states (or a total of 5 states including the entry and exit null states). Use the mean and diagonal elements of the covariance to populate the <Mean> and <Variance> fields. Again, check that there are no errors reading in your file with readhmm.
You need to submit the text file containing your twelve HMM prototype definitions (eleven digits plus silence) in HTK format.

3.5 Assignment submission

Collate the files identified as deliverables in section 2 in one directory.
Package and compress them into a zip file, e.g., called lab2_YourName.zip.
Submit the file as assignment 2 in SurreyLearn.

4. Further reading

If you would like to find out more on training HMMs, you should have a look at the following book:

Jelinek: Jelinek, F., Statistical Methods for Speech Recognition, MIT Press, 1998 [0-262-10066-5].

5. Trouble shooting

Configuring any HTK tool: `HCopy`, `HList`, etc.

Consult the HTK Book (pdf on campus only).

Running `HCopy -C config_WAV_MFCC_0 input.wav output.mfcc` in Windows

If HTK has been correctly installed on your computer and you have added the program directory to your path, then you should be able to execute HCopy (or any other HTK tool) from any directory. You can check this by running HCopy with no arguments (or HCopy -V), for example.

In order to execute the HCopy command above, it needs to find and read the configuration file and the input wave file. In my example, I just put the name input.wav but you should change this for the name of one of your speech files. The output file name output.mfcc should be changed accordingly to something sensible.

In order to check that the configuration and input wave files are in your working directory, you should check the full file names, file sizes and permissions to ensure that there are no silly mistakes, which you can do in a DOS command window with the instruction dir and by checking the file properties. You can do this using a file manager, but you need to check the file extensions to make sure the full filenames are correct.

If you have checked all these things and are still having problems, then there may be some problem with conversion from Linux to DOS. You could open the configuration file in a text editor in Windows and check that it looks the same as it did in the web browser, re-save it and try again.

Finally, if all else fails and you are running out of time, I recommend that you use the installation on the linux machines in our labs where we know it works!

[ Home | Section: 1, 2, 3, 4 | Dr Jackson ]