EEM.ssr: Lab 1 - speech enrolment and analysis

[ Home | Section: 1, 2, 3 ]

Assessment: students should work individually and submit their files via SurreyLearn to Dr Jackson before 4pm on Tuesday 4 March 2014

Required software: Praat (v.5.1 or higher) is the recommended application for this laboratory

Additional software: Audacity (v1.2.4) for sound recording/editing, SFS (v.4.7) for speech analysis, annotation and visualisation, HTK (v.3.4.1) for audio feature extraction, HMM training and results analysis

Assessment:	students should work individually and submit their files via SurreyLearn to Dr Jackson before 4pm on Tuesday 4 March 2014
Required software:	Praat (v.5.1 or higher) is the recommended application for this laboratory
Additional software:	Audacity (v1.2.4) for sound recording/editing, SFS (v.4.7) for speech analysis, annotation and visualisation, HTK (v.3.4.1) for audio feature extraction, HMM training and results analysis

Aims of the Experiment

To gain experience of recording, analysing and manipulating speech signals. To use an existing software application to extract acoustic features of the speech signal. To familiarise oneself with the audio capturing equipment and editing software.

1. Background

1.1 Acoustical considerations

Although it may appear to be a trivial task, there are many factors that must be considered in order to obtain high quality recordings of speech, such as the acoustical environment, the particular characteristics and placement of the microphone, and the conversion of the analogue signal into a digital format. It is recommended that the reverberation time of a recording studio should not exceed 0.4 seconds. There are several criteria for selecting a microphone type. Directional microphones are often used since the most sensitive region in the directivity pattern of the microphone can be directed at the desired source, making it less sensitive to room reflections arriving from other angles of incidence. Directional microphones (pressure gradient microphones in particular) exhibit a proximity effect, and consequently their frequency response is generally not flat, and depends on the distance between the speaker and the microphone.

Figure 1: Directivity patterns of two typical microphone types

1.2 Analogue-to-digital (A/D) conversion

There are two main parameters that represent the process of analogue-to-digital conversion of an audio signal: the sample rate (number of samples per second, in Hz) and the number of bits used to represent each sample (the bit resolution). Sampling the analogue signal at discrete time intervals is, in principle, a lossless process (i.e., it allows for a perfect representation and reconstruction of the signal), providing that the highest frequency component within the signal is less than the Nyquist frequency (half the sampling rate). In contrast, quantization is a lossy process: once the continuous value of the analogue signal is quantized as a sequence of bits at some moment, some detail is lost, causing quantization error (fig. 2). The maximum amplitude of this depends on the bit resolution: the larger the number of bits, the more accurately the signal can be encoded, and hence, the smaller the quantization error. The quantization error sounds like white noise superimposed on the signal, or if the signal amplitude to quantizing level (the minimum difference between two sampled values) is small, as an interference that is correlated with the original signal. For the sinusoid in fig. 2, the quantization error is also periodic, i.e. harmonics of the sinusoid are added during the quantization process. To avoid this effect, which is annoying especially for low-level, low frequency tones, a small amount of random noise or dither can be added to the signal prior to quantization. Although this results in a small continuous noise being added to the signal, perceptually speaking, this is overcome by the randomization of the quantization error.

Figure 2: A/D Conversion at 3-bit resolution

To obtain high quality digitised speech it is advisable to use the highest possible sample rate and bit resolution (standard CD quality audio is sampled at 44.1 kHz, 16 bit resolution, whereas a high-end analogue-to-digital converter (ADC) may have a 96 kHz sample rate and 24-bit resolution). We will begin by considering 48 kHz, used in many professional audio systems. However, these are often restricted by a limited bandwidth available in a telecommunication channel or by limited space available on a storage medium. Typically, in digital telephony, speech is first band-limited to between 200 and 3400 Hz and then sampled at 8 kHz. Good quality speech normally uses a sampling rate of 16 kHz or higher, for which wideband speech codecs exist. For a given application, a trade-off between the overall bit rate (sample rate × bit resolution) and speech quality must be sought.

2. Overview

An overview of the tasks for this experiment is as follows:

Record, analyse, annotate and extract acoustic features from two isolated words
Record and annotate a set of isolated digits

A summary of the deliverables that are to be submitted is as follows:

Two isolated words (yes/no):

wave files (mono, 16-kHz, 16-bit), of "yes" and "no" with a short silence at the start and end of each (i.e., 20 -100 ms)
phone-level label files (Xwaves format) for each word, including silence
word-level label files (Xwaves format) for each word, including silence
formant frequency text file (generated manually), giving typical F1-F3 values in each of four phones [eh, s, n, ow]
LFCC files (short text file format) for each word
MFCC files (short text file format) for each word

Set of isolated digits (oh/zero/1..9)

wave files (mono, 16-kHz, 16-bit) of each spoken digit with a short silence at the start and end of each (i.e., 20 -100 ms)
word-level label files (Xwaves format) for each word, including silence
MFCC files (short text file format) for each word

The deadline for submission of your files via SurreyLearn to Dr Jackson is 4pm on Tuesday 4 March 2014.

3. Experimental Work

3.1 Setting the recording conditions for the microphone

Open Praat (e.g., by typing praat & in a terminal window) and select [New| Record mono sound].
If the microphone is not responding, check that the jack is plugged into the correct input on the front of the computer and that the microphone is switched on.
If there is still no response, you can open the audio controls [System| Preferences| Sound], or by clicking on the loudspeaker in the top right of the desktop screen [Volume control]. The Hardware should be set on Internal Audio with the Analogue Stereo Duplex profile. The Input volume needs to be about 80% and not muted, with connector Microphone 2. Hopefully, this should have fixed the problem [Close]!
Record the word "test" at 3 different distances between the microphone and the speaker: (1) as close as possible to the speaker, (2) at a distance of about 20 cm, (3) at about 80 cm or a full arm's length. The parameters of the A/D conversion should be the same in all three cases (mono, mic input, 48 kHz), but the recording gain can be adjusted if necessary to maintain a reasonably loud recording level without any audible or visible distortion.
Name the recorded sounds test1, test2 and test3. Listen to all 3 recordings and describe their sound by providing comments in Table 1.

Position of the microphone	Near	Intermediate	Far
Distance between the microphone and the speaker	2 cm	20 cm	80 cm
Clarity of speech (e.g. high, low, medium). This attribute is related to how easy it is to understand.
Is speech timbre "coloured" by the room acoustics? (Yes/No)
Can you hear a proximity effect? (Yes/No)
Are any distortions audible? (e.g. plosive sounds, background/electrical noise). If so, what kind?
Which distance out of these three would you recommend?
Other comments

Table 1: Effects of the microphone position - Informal listening test report.

Using the best configuration and trying to maintain a constant sound intensity in both cases, record two isolated words, "yes" (/j eh s/) and "no" (/n ow/). Rename the sounds accordingly, once you have checked that the recordings are acceptable.

3.2 Editing "yes" and "no"

Taking your best recordings of "yes" and "no", use Praat to view each of the two waveforms [Draw] and the corresponding wideband spectrograms [Spectrum| To spectrogram] followed by [Draw| Paint]. Can you see distinct characteristics of the two speech patterns?
Repeat this operation but adjust the window length from the default (5 ms) to a longer duration of 25 ms to obtain the narrowband spectrogram. Do the patterns remain the same? What changes can you observe?
For each file in turn, create annotations [Annotate| To textgrid] with tier names "word phone", leaving the point tiers empty. Select the sound and the text grid together, [Edit], and then enter the word and phone annotations, remembering to include silence (as sil) in both tiers. You should aim to achieve an accuracy within 10 ms. You will get the best results by a combination of listening to speech segments, and viewing the waveforms and spectrograms.
Hence, create two new words, "yo" (/j ow/) and "ness" (/n eh s/), from your existing recordings by splicing the first part of one with the second part of the other. The method involves the following steps:
- cut a segment from each of the original recordings: [Edit| Copy selection to clipboard], [File| Extract selected sound (time from 0)]
- combine them into a single waveform: select both, [Combine| Concatenate]
Listen to the resultant sounds, and play the original and the edited recordings to the demonstrator. Can you hear any artefacts? How could you improve the quality of the synthesized words?
Now convert your original recordings ("yes" and "no") to a lower sampling rate of 16 kHz ready for the next stages of speech analysis [Synthesize| Convert| Resample].
Using your two 16-kHz sounds, extract the formants [Formants & LPC| To formant (Burg)]. From these results, inspect the formant frequency values for the first three formants (F1, F2, F3) and compare these with the spectrograms. Write down typical values for each of the following four phones: [eh], [s], [n] and [ow].
Extract LFCC features for your two recordings [Formants & LPC| To LPC (autocorrelation)] with order 16, and then [To LFCC] with 12 coefficients.
Extract MFCC features for your two speech files [Formants & LPC| To MFCC] with 12 coefficients. You will use this type of feature in the recognizer you will develop in the following lab assignments.
If you have not already done so, you should save a copy of your work at this stage. Export the 16-kHz sound files in WAV format [Write| Write to WAV file]. Extract the annotations [Extract tier] and then [Write| Write to Xwaves label file]. For the computed LFCCs and MFCCs, dump them to a compact text file [Write| Write to short text file].

Note for Windows users:

3.3 Recordings of isolated words for digit recognition

In subsequent labs, you will be developing a digit recognizer. So you will first need to make recordings of the digits. Using the best configuration and trying to maintain a constant sound intensity in all cases as before, record eleven isolated words at the lower 16-kHz sampling rate [New| Record mono sound]: "oh", "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", and "nine". Rename the sounds accordingly, once you have checked that the recordings are acceptable by listening and viewing the waveforms [Draw] and spectrograms [Spectrum| To spectrogram] followed by [Draw| Paint].
For each of the eleven files in turn, create annotations [Annotate| To textgrid] with just one tier name "word", leaving the point tiers empty. Select the sound and the text grid together, [Edit], and then enter the word annotations, remembering to include silence (as sil). This process is called endpointing.
Extract MFCC features for your eleven speech files [Formants & LPC| To MFCC] with 12 coefficients, as before.
Export the 16-kHz sound files in WAV format [Write| Write to WAV file]. Extract the annotations [Extract tier] and then [Write| Write to Xwaves label file]. For the MFCCs, dump them to a compact text file [Write| Write to short text file].

3.4 Assignment submission

Collate the files identified as deliverables in section 2 in one directory.
Package and compress them into a zip file, e.g., called lab1_YourName.zip.
Submit the file as assignment 1 in SurreyLearn.