EE1.LabB: ME1 Speech Capturing and Editing
||2-3 members per group|
||Lab is split into two halves (3.1-3.2 and 3.3-3.4) that can be completed independently|
||Audacity (v1.2.4 or later) is the recommended application for this laboratory|
Aims of the Experiment
To explore the effect of microphone placement on the quality of recorded speech.
To evaluate the effects of sample rate and bit resolution on the quality of captured speech, and the effect of adding dither prior to quantization.
To familiarise oneself with the audio capturing equipment and editing software.
J. Borwick, "Sound Recording Practice", 4th ed., Oxford University Press, Oxford, 1996.
- Chapter 3: Digital theory. Topics: The digital signal, Sampling, Quantizing, Numbering.
- Chapter 6: Microphones. Topics: The acoustical/mechanical conversion, Effects of distance on frequency response.
- Chapter 15: The spoken word. Topic: A single voice.
1.1 Acoustical considerations
Although it may appear to be a trivial task,
there are many factors that must be considered in order to obtain high
quality recordings of speech, such as the acoustical environment, the
particular characteristics and placement of the microphone, and the
conversion of the analogue signal into a digital format. It is
recommended that the reverberation time of a recording studio should
not exceed 0.4 seconds. There are several criteria for selecting a
microphone type. Directional microphones are often used since the most
sensitive region in the directivity pattern of the microphone can be
directed at the desired source, making it less sensitive to room
reflections arriving from other angles of incidence. However, it must
be born in mind that directional microphones (pressure gradient
microphones in particular) exhibit a proximity effect, and consequently
their frequency response is generally not flat, and depends on the
distance between the speaker and the microphone. Occasionally, this
property is deliberately exploited as a form of spectral modification.
For example, by placing a microphone near the speaker it is possible to
boost low frequencies in speech and hence make the sound "warmer".
Directivity patterns of two typical microphone types
1.2 Analogue-to-digital (A/D) conversion
There are two main parameters that represent the
process of analogue-to-digital conversion of an audio signal: the
sample rate (number of samples per second, in Hz) and the number of
bits used to represent each sample (the bit resolution). Sampling the
analogue signal at discrete time intervals is, in principle, a lossless
process (i.e., it allows for a perfect representation and reconstruction
of the signal), providing that the highest frequency component within
the signal is less than the Nyquist frequency (half the sample rate).
In contrast to sampling, quantization is a lossy process. In other
words, once the continuous value of the analogue signal is quantized as
a sequence of bits at some moment, some detail is lost. This detail is
referred to as the quantization error (fig. 2). The maximum amplitude
of this depends on the bit resolution: the larger the number of bits,
the more accurately the signal can be encoded, and hence, the smaller
the quantization error. The quantization error sounds like white noise
superimposed on the signal, or if the signal amplitude to quantizing level
(the minimum difference between two sampled values) is small, as
an interference that is correlated with the original signal. For the
sinusoid in fig. 2, the quantization error is also periodic, i.e. harmonics of the sinusoid are added during
the quantization process. To avoid this effect, which is annoying especially for low-level, low frequency tones, a small amount of
random noise or dither can be added to the signal prior to
quantization. Although this results in a small continuous noise being
added to the signal, perceptually speaking, this is overcome by the
randomization of the quantization error.
A/D Conversion at 3-bit resolution
In order to obtain high quality digitised speech it
is advisable to use the highest possible sample rate and bit resolution
(standard CD quality audio is sampled at 44.1 kHz, 16 bit resolution,
whereas a high-end analogue-to-digital converter (ADC) may have a 96
kHz sample rate and 24-bit resolution). However, these are often
restricted by a limited bandwidth available in a telecommunication
channel or by limited space available on a storage medium. Typically,
in digital telephony, speech is first band-limited to between 200 and
3400 Hz and then sampled at 8 kHz, although wideband speech codecs
exist with 16kHz sample rates. For a given application, a trade-off
between the overall bit rate (sample rate × bit resolution) and speech
quality must be sought.
- Draw and label five typical microphone directivity patterns. State
which of these you would recommend for speech recording, and it what
context and acoustical environment these would be appropriate.
- What is the proximity effect?
- Suggest some factors to consider when deciding upon the microphone placement.
- For a typical PCM ADC, in what way is the sample rate related to the bandwidth of the recorded signal?
- What is the effect of bit resolution on the signal-to-noise ratio?
Write down an equation giving the relationship between the bit
resolution of an ADC and the signal-to-noise ratio in decibels.
- Illustrate by means of a graph what is meant by a triangular probability density function (TPDF) dither signal of maximum +/− 2 quantizing levels.
3. Experimental Work
NB: Unless directed not to do so, make sure that recordings are saved as mono '.wav' files, and no dither is added!
Check the settings in the "Preferences" before starting.
3.1 Choosing the microphone position
- Record the following quote at 3 different distances between the
microphone and the speaker:
(1) as close as possible to the speaker,
at a distance of about 50 cm,
(3) as far as possible.
"The microphone type is not critical provided it has a
smooth frequency response. Avoid bright-sounding microphones as they
often have resonances in the upper-middle and top registers which
emphasize sibilance." (Borwick, 1996)
The parameters of the A/D conversion should be the same
in all 3 cases (16 bits, 44.1 kHz, ) but the recording gain in each
case should be adjusted in the mixing desk in order to maintain a
reasonably loud recording level without any audible or visible
distortion at the loudest speaking volume. Make sure you are aware of
the directivity pattern of the microphone being used.
- Listen to all 3 recordings and describe their sound by providing comments in Table 1.
Effects of the microphone position - Informal listening test report.
|Position of the microphone
|Distance between the microphone and the speaker
|Clarity of speech (e.g. high, low, medium). This attribute is related to how easy it is to understand.
|Is speech timbre "coloured" by the room acoustics? (Yes/No)
|Can you hear a proximity effect? (Yes/No)
|Are any distortions audible? (e.g. plosive sounds, background/electrical noise). If so, what kind?
|Which distance out of these three would you recommend?
- Optimise the distance between the microphone and the speaker.
This may involve a number of trials during which you will listen and
subjectively assess the quality of the captured speech. Suggest an
appropriate distance and comment on the criteria you used to make this
- For a microphone-to-speaker distance of about 50 cm record the following two words or use a pre-recorded sample:
- Edit the recorded words in order to change them into two words sounding like:
Advice: Cut the waveform at zero-crossings, and to speed up editing learn to use the shortcuts/control keys of the software application.
- Demonstrate the original and the edited recordings to the demonstrator.
3.3 Effects of the sample rate on audio quality
- Use your recording from section 3.1 or another speech recording of your choice of high fidelity (e.g. http://sound.media.mit.edu/mpeg4/audio/sqam/) and sampled at 44.1 kHz, 16-bit resolution.
- Copy and convert the original recording into the following formats:
These will be referred to as the down-sampled recordings.
- 16 bit PCM, 32 kHz,
- 16 bit PCM, 16 kHz,
- 16 bit PCM, 8 kHz.
- Listen to all recordings and comment on their quality in Table 2.
Effects of the sample rate on audio quality - Informal listening test report.
||16 bits, 44.1kHz (original)
||16 bits, 32 kHz
||16 bits, 16 kHz
||16 bits, 8 kHz
|Brightness (e.g. very bright, bright, dull, very dull).
This is a perceptual term used to describe the ratio of energy in high
frequencies to low frequencies
|Clarity of speech (e.g. high, low, medium). This attribute is related to how easy it is to understand
|Hiss (e.g. imperceptible, perceptible but not annoying,
very annoying). Hiss is a degradation similar to the sound "s" or white
|Overall Sound Quality. Use the following grading scale:
5 - Excellent, 4 - Good, 3 - Fair, 2 - Poor, 1 - Bad
- For each of the 4 recordings calculate the mean values and standard
deviations of Overall Sound Quality scores given by yourself and at
least two other students undertaking this experiment. Present the
Overall Sound Quality results graphically as a function of sample rate
using an error bar plot (where the standard deviations are represented
by error bars).
- Convert all the down-sampled recordings back to the original sample
rate of 44.1 kHz. Display the spectrograms of all four recordings. (The
spectrogram shows the distribution of energy contained within the
signal as a function of both time and frequency. The amount of signal
energy within a particular time-frequency "block" is proportional to
the average intensity of the spectrogram in this region). Explain why
the bandwidths of the audio signals differ for each recording.
Advice: A window length of around 2048 samples and hop size
of 1024 samples should be appropriate for computing the spectrogram at
a sample rate of 44.1 kHz
- At what sample rate is there (in your opinion) a significant distortion of the speech signal?
3.4 Effects of bit resolution and dither on audio quality
- First check that the original recording is stored as a 16-bit signed integer in mono (rather than 32-bit floating point numbers in stereo, for example).
Copy it and convert it to:
Advice: If the software program does not have a function for
converting to 8- or 4-bit PCM, then attenuate the original 16-bit signal by a factor (e.g., for the 4-bit case by 2(16−4) = 212, approximately 72 dB), save the result at 16-bit resolution, and afterwards amplify the attenuated signal by the same factor.
To do this in Audacity, it seems that you have to "Amplify" the
signal by −24 dB three times (24+24+24=72).
I recommend that you save the data at this point as a WAV file, open it as a new file in Audacity, then "Amplify" it again by +24 dB three times.
It should now be of comparable loudness to the original
recording, but effecitvely of lower bit resolution.
This trick can be
used to listen to the speech quality at any desired resolution.
- a WAV file in 8-bit PCM format, at 44.1 kHz
- a WAV file in 4-bit PCM format, at 44.1 kHz
- Listen to the converted recordings and comment on the nature of the
quantization error introduced by decreasing the bit resolution. The quantization error can be obtained by
converting the lower bit resolution recording back to 16-bit, inverting
it, and then adding it to the original recording.
Advice: You can listen to the combined output by pasting the original mono signal into the current window, selecting both signals (one inverted), and then playing.
- Evaluate the sound quality of all three recordings in a similar manner to table 2 in section 3.3.
- Can you explain why, from the point of view of A/D conversion, it is important to maintain a reasonably high recording level?
- Choose a segment of a vowel that is
relatively noiseless of around 100 ms in length from the original recording, and note the start
and end times of this segment. Observe the magnitude spectrum of this
segment and measure the ratio of the maximum spectral amplitude to the
spectral noise floor in dB. Perform the same operation on the 8-bit and
4-bit recordings taking care to use the same segment and tabulate your
results. How does the bit resolution affect this ratio?
- Copy the original recording and attenuate it by 72 dB, and then save it as separate 16-bit PCM, 44.1 kHz files under
the following conditions:
(dither settings in the Audacity preferences will have to be changed).
- no dither added
- rectangular probability density function (RPDF) dither added prior to quantization
- triangular probability density function (TPDF) dither added prior to quantization
- Amplify the above files by 72 dB (the amplification by +/− 72 dB has been used to exaggerate the effects of adding dither so that it is easily audible). Have a close look at the waveform and illustrate how the nature of the quantization differs between the three cases (TPDF dither is generally accepted as the most appropriate form of dither). Note down your impression of the perceptual characteristics of the three signals.
- In telecommunications applications often we need to reduce the
overall bandwidth of the signal due to restricted transmission
conditions. A crude way of undertaking this task is to reduce the
sample rate or bit resolution. However, as you will have noticed, this
has an adverse effect on sound quality. Based on your measurements of
Overall Sound Quality in sections 3.3 and 3.4, given an available
bandwidth of 64 kbits/s, what sample rate and bit resolution would you
recommend for PCM coding of speech?
If you would like to find out more on PCM and dithering, you should have a look at the following academic article:
Lipshitz, S.P., and Vanderkooy, J., (2004). "Pulse code modulation - an overview", Journal of the Audio Engineering Society, 52(3): 200-215.