Week 1
Speech recognition:
 Give a definition of automatic speech recognition that distinguishes it from other speech technologies.
ASR converts spoken words to text, whereas spoken language understanding includes machine reading comprehension, for example.
 How can speech technologies be said to increase access of
people with disabilities to computerbased systems?
As a method for converting natural language from one modality (auditory) into another (visual), ASR can form part of a more flexible humancomputer interface, especially for those with limited auditory function.
Equally, speech synthesis (i.e., TTS) can help the blind or visually impaired access information stored
on computer or the internet by vocalising text.
Speech recognition can be used to assist those unable to type, e.g., through disability or injury, with text input.
Speech synthesis can provide a voice where the user cannot speak.
Sometimes providing an interface that combines modalities can be of
benefit to people with learning difficulties or special needs of various descriptions.
Speech communication:
 What acronym denotes the convention for transcribing the sounds of
the world's languages?
IPA, International Phonetic Alphabet.
 What is the difference between phones and phonemes?
A phoneme is an abstract sound category within a language, whereas a
phone denotes a practical realisation by a given speaker in a particular
context.
 What three attributes of natural speech contribute most to the
nonlinguistic aspects known as prosody?
(i) intensity or amplitude,
(ii) timing or duration, and
(iii) pitch or intonation.
Phonetics:
 How can a basic understanding of phonetics facilitate
the study of speech signals?
It informs us of the sounds of a language and their categorisation.
 What is a diphthong? Illustrate your answer with an example.
A transition from one vowel sound to another, e.g., as in "by",
"boy" and "bough".
 What class of sounds includes /m/, /n/ and /ng/?
Nasals are all produced with an occlusion of the vocal tract in the oral cavity.
 What characteristics of the acoustic signal are most useful for
discriminating vowels?
The first two formant frequencies, F1 and F2.
 Give three environmental factors that can affect the way
speech is produced.
Noise (Lombard effect), vibration, stress/fear, cognitive loading, fatigue,
gas properties (air pressure/helium), etc.
 What are the three places of articulation for English plosive
consonants (aka. stops )?
Labial (at the lips), alveolar (at the palatal ridge toward the front of the mouth),
and velar (on the soft palate further back in the mouth).
 What is the main difference between the way that the sounds
/t/ and /s/ are produced?
/t/ is a plosive consonant which has a transient burst when air is suddenly
released;
/s/ is a fricative consonant which is produced by turbulent flow and can be
sustained.
However, they are both produced at approximately the same place of articulation.
 What name is given to the effect in fluent speech where, for
example, the phrase ``isn't it'' is pronounced as if it were ``in'it''?
Elision, just like "bread and butter" becoming "brembudder".
Week 2
Dynamic Time Warping:
 Write a pseudocode description of the DTW algorithm using the
transitions shown in Fig.1 (left).
Apply a distortion penalty for the horizontal (H) and steepest (S) transitions,
d_{H} = d_{S} = d_{μ}/4,
where d_{μ} denotes the mean distance found across the
training data.
1. Initially:

D(1,i) =

{ d(1,i)
{ 0

for i=1
otherwise (2,...,N)

2. For t=2,...,T:

D(t,i) =

{ d(t,i) + D(t1,i)+d_{H}
{ d(t,i) + min[ D(t1,i)+d_{H}, D(t1,i1) ]
{ d(t,i) + min[ D(t1,i)+d_{H}, D(t1,i1), D(t1,i2)+d_{S} ]

for i=1
for i=2
otherwise (3,...,N)

3. Finally:

Δ =

D(T,N)


 Modify your pseudocode to disallow two consecutive horizontal
transitions, as shown in Fig.1 (right).
1. Initially:

D(1,i) =

{ d(1,i)
{ 0

for i=1
otherwise (2,...,N)


isFlat(i) =

FALSE

for i=1,...,N

2a. For t=2:

D(t,i) =

{ d(t,i) + D(t1,i)+d_{H}
{ d(t,i) + min[ D(t1,i)+d_{H}, D(t1,i1) ]
{ d(t,i) + min[ D(t1,i)+d_{H}, D(t1,i1), D(t1,i2)+d_{S} ]

for i=1
for i=2
otherwise (3,...,N)


isFlat(i) =

{ TRUE
{ FALSE

if arg min == D(t1,i)+d_{H}
otherwise


for i=1,...,N

2b. For t=3,...,T:

if isFlat(i) == TRUE



D(t,i) =

{ INFINITY
{ d(t,i) + D(t1,i1)
{ d(t,i) + min[ D(t1,i1), D(t1,i2)+d_{S}
]

for i=1
for i=2
otherwise (3,...,N)

isFlat(i) =

FALSE

for i=1,...,N



else



D(t,i) =

{ d(t,i) + D(t1,i)+d_{H}
{ d(t,i) + min[ D(t1,i)+d_{H}, D(t1,i1) ]
{ d(t,i) + min[ D(t1,i)+d_{H}, D(t1,i1), D(t1,i2)+d_{S} ]

for i=1
for i=2
otherwise (3,...,N)

isFlat(i) =

{ TRUE
{ FALSE

if arg min == D(t1,i)+d_{H}
otherwise


for i=1,...,N



end

3. Finally:

Δ =

D(T,N)


 How can silence and wildcard templates be used during
enrollment to help reduce endpoint detection errors?
These templates are introduced in (Holmes & Holmes, §8.10, p.123), and
their use in training is duscussed in §8.13 on pp.1256.
In essence the silence template allows for gaps and pauses in a recording, and
the wildcard can be used to represent parts of a training utterance that are
being captured.
Speech production:
 What human organ produces quasiperiodic source of voiced sounds, such
as vowels?
Vocal folds (aka. vocal cords) which are in the larynx.
 What is fundamental frequency (also called
f_{0}), and how is it produced?
It is the frequency that separates the pitch harmonics in the spectrum and is
usually the frequency of the lowest harmonic peak.
The periodic vibration is the result of oscillation of the vocal folds,
in the larynx.
 The vocal tract comprises three main passages. The pharynx and
oral cavity are two. What is the third?
Nasal cavity.
 The velum, larynx and jaw cooperate in the production of speech.
Name two other articulators.
Lips and tongue (body or tip).
 What is a formant and how is it produced?
A frequency region of high amplitude in a speech spectrum or spectrogram which
is produced as the result of an acoustic resonance in the vocal tract.
Speech analysis:
 What is the name of the organ in the inner ear that is responsible for
converting physical vibrations into a set of nerve responses
(i.e., electrical signals)?
Cochlea.
 What is the bandwidth to which the human ear responds (to one significant
figure), and what are the implications?
20 kHz, which implies a sampling rate above 40 kHz for hifi quality.
 If I calculate a DFT directly from a 40ms section of a speech signal, what
will be the spacing of frequency bins in the spectrum?
25 Hz.
 Boxcar (rectangular), Kaiser and Blackman define certain kinds of window. Name
three other popular window functions.
Hann, Hamming, Bartlett (triangular), Tukey, Gaussian, etc.
 What would be an appropriate window size for a narrowband
spectrogram?
Choosing 40 Hz between adjacent bins implies a minimum 25 ms window,
typically it is at least 30 ms.
 Give an estimate of the SNR for a fullscale 16bit speech signal
in relation to the quanisation noise.
Signal uses 2^{16} peaktopeak, the noise is 2^{0}=1 (+/ half
a bit), which gives a range of 96 dB (=16×6 dB).
Week 3
Markov models:

Given an initial state vector π = [0.25 0.50 0.25], and
statetransition probability matrix
A = [0.25 0.50 0.25; 0.00 0.25 0.50; 0.00 0.00 0.25]:
 draw the model to show the states and permissible transitions;
See figure S1 opposite.
 calculate the probability of the state sequence X = {1, 2, 2, 3}
The product of the initialstate prob and subsequent transition probs is:
P(X={1,2,2,3}M) = ¼ × ½ × ¼ × ¼
× ¾ = 3/512, which includes the transition to the exit node.
 Using the Markov model presented in the lectures,
if today has been rainy, what is the most likely weather
 tomorrow,
It's most likely to rain, since P(x_{2}=1x_{1}=1) = 0.4;
meanwhile P(x_{2}=2x_{1}=1) = P(x_{2}=3x_{1}=1) = 0.3.
 in two days' time (i.e., on Wednesday)?
It turns out most likely to be sunny:
P(x_{3}=3x_{1}=1) =
(0.4×0.3)+(0.3×0.2)+(0.3×0.8) = 0.42;
whereas P(x_{3}=1x_{1}=1) = 0.25 and
P(x_{3}=2x_{1}=1) = 0.33.
 Calculate the probabilities of rainrainsun, raincloudsun and rainsunsun,
assuming π_{rain} = 1.
P(X={1,1,3}x_{1}=1,M) = 0.12,
P(X={1,2,3}x_{1}=1,M) = 0.06,
P(X={1,3,3}x_{1}=1,M) = 0.24,
 Hence, if we are told that it will be sunny on Wednesday for certain and that
it's rainy today, what's the most likely weather tomorrow?
Well, it looks like it's going to be sunny tomorrow too!

Fig.S1. State transitions.

Hidden Markov models:
 Considering the statetransition topologies shown in Figure 2:
 write an expression for the state duration probability
P(τλ) in Fig.2(a);
P(τλ) = (a_{ii})^{τ1} (1a_{ii})
= 0.9^{τ1}×0.1
 write an expression for the duration probability for each state
P(τx=i,λ)
in Fig.2(b);
P(τ_{1}λ) = 0.8^{τ11}×0.2
P(τ_{2}λ) = 0.6^{τ21}×0.4
 hence derive the distribution of duration probabilities for the entire
model in Fig.2(b);
P(τ=τ_{1}+τ_{2}λ) =
P(τ_{1}λ) * P(τ_{1}λ), where * denotes
convolution.
Hence, P(τλ) = {0,
0.2×0.4×Σ_{ν=0..τ2}
0.8^{τ2ν} 0.6^{ν}, ...}
=> P(τλ) = {0, 0.08, 0.112 0.1184, 0.112, 0.0896, ...}
 how many terms are there in this expression for the model duration
with τ=5?
There are four terms, which correspond to the state sequences
{1,1,1,1,2}, {1,1,1,2,2}, {1,1,2,2,2} and {1,2,2,2,2}.

(a)
(b)
Fig.2. HMM state transitions.

Feature extraction 1:
 How can a bank of bandpass filters (each tuned to a different centre
frequency) be used to extract a feature vector that describes the overall
spectral shape of an acoustic signal at any particular time?
A vector can be made up by taking the energy in each band, averaged over
a short period (e.g., 20 ms).
 The real cepstrum of a digital signal sampled at 48 kHz is defined
as c_{s}(m) = IDFT( lnS(k) ), where S(k) is the signal's discrete
Fourier spectrum, ln. denotes the natural logarithm and IDFT is the inverse
discrete Fourier transform.
Considering only real, symmetric elements in the logmagnitude spectrum (i.e.,
the cos terms), draw the shapes of the first four cepstral coefficients
c_{0}, c_{1}, c_{2} and c_{3}, in the
logmagnitude spectral domain.
These are respectively a constant and three cosine functions of the form
cos x, cos 2x and cos 3x.
 What properties of the Melfrequency cepstrum make it more like human
auditory processing, compared to the real cepstrum?
The warping of the frequency axis, based on human pitch perception.
 In calculating MFCCs, what is the purpose of:
 the log operation;
compression, sourcefilter decomposition
 melfrequency binning;
perceptual weighting of information
 Discrete Cosine Transform?
make coefficients independent, diagonalising their covariance
Week 4
Hidden Markov models:
 Draw a trellis diagram for the HMM in Fig.2(b) and a 5frame observation
sequence.
 Show all the paths that arrive in the final null node after the fifth
frame.
See Figure S2.
 How many different paths are there for this model and number of
observations?
Four, just as there were four terms in the answer to Q.5d above (week 3).
 Imagine you are designing an optical character recognition (OCR) system
for converting images of written words into text, based on HMMs.
The observations you're given come in the form of pixellated greyscale bitmaps
of a single line of writing.
Explain in general terms how you would construct the following components of the
system:
 frames of feature vectors to make up the observation sequences;
Either parameterise each column of pixels as a feature vector, or register the
locations of the individual characters and produce a vector from the extracted
bitmap.
 the models (each one comprising a state or set of states);
Either model each character as a sequence of pixel columns, or as one character
observation.
Could include some contextsensitive models for handwriting strokes, or for
special printed characters, e.g., "fi" and "fl" compared to
a standard "f".
 suitable annotations to be used during training;
A transcript of the text, including blanks and blotches where encountered.
 any special models (e.g., for dealing with blotches or blank spaces
within the pictures).
A blank space model and a generic blotch model could be supplemented by models
for any unusual characters (e.g., footnote, trademark and copyright symbols,
mathematical symbols or phonetic alphabet).
A model for lines could be considered, but it might be better to try to
eliminate them in the preprocessing stage.
 How do the components of the OCR system compare to those for an ASR system
designed to perform an Isolated Word Recognition task?
Pretty well!
The same components are required for an IWR system:
(a) sequences of acoustic feature vectors,
(b) models with multiple states for to represent the sequence of sounds produced in
each vocabulary word,
(c) transcriptions of the utterances, and
(d) silence and wildcard models.

(a)
(b)
Fig.2. HMM state transitions.
Fig.S2. Trellis diagram for τ=5.

Week 5
HMM decoding:
 In the Viterbi algorithm, what is the purpose of the variable
ψ_{t}(i)?
Indicator of the best predecessor, which can then be used during traceback to
identify the best path.
 What is the meaning of Δ^{*}?
The total likelihood of the best path, the joint probability of the observations
and the optimal state sequence given a particular model
P(O,X^{*}λ).
 Using the Viterbi algorithm, calculate the path likelihoods
δ_{t}(i), the value of
Δ^{*}, and use the values of
ψ_{t}(i) to extract the best path X^{*}.
Model parameters:
π=[1 0], A=[0.8 0.2; 0.0 0.6],
η=[0 0.4]^{T}, and B=[0.5 0.2 0.3; 0.0 0.9 0.1] where the
columns of the B matrix correspond to green (G), blue (B) and red (R) events,
respectively.
 for observations O¹={G,B,B} (worked example from lecture)
Step 1.

δ_{1}(1)
= 
1×0.5 = 0.5

ψ_{1}(1) = 0


δ_{1}(2)
= 
0

ψ_{1}(2) = 0

Step 2.

δ_{2}(1)
= 
0.5×0.8×0.2 = 0.08

ψ_{2}(1) = 1


δ_{2}(2)
= 
0.5×0.2×0.9 = 0.09

ψ_{2}(2) = 1


δ_{3}(1)
= 
0.08×0.8×0.2 = 0.0128

ψ_{3}(1) = 1


δ_{3}(2)
= 
max[0.08×0.2, 0.09×0.6] × 0.9
=0.054×0.9 = 0.0486

ψ_{3}(2) = 2

Step 3.

Δ^{*}
= 
max[0.0128×0, 0.0486×0.4] = 0.01944

x_{3}^{*} = 2

Step 4.

x_{2}^{*}
= 
ψ_{3}(x_{3}^{*}) = ψ_{3}(2) = 2



x_{1}^{*}
= 
ψ_{2}(x_{2}^{*}) = ψ_{2}(2) = 1

X^{*}¹={1,2,2}

 for observations O²={R,B}
Step 1.

δ_{1}(1)
= 
1×0.3 = 0.3

ψ_{1}(1) = 0


δ_{1}(2)
= 
0

ψ_{1}(2) = 0

Step 2.

δ_{2}(1)
= 
0.3×0.8×0.2 = 0.048

ψ_{2}(1) = 1


δ_{2}(2)
= 
0.3×0.2×0.9 = 0.054

ψ_{2}(2) = 1

Step 3.

Δ^{*}
= 
max[0.048×0, 0.054×0.4] = 0.0216

x_{2}^{*} = 2

Step 4.

x_{1}^{*}
= 
ψ_{2}(x_{2}^{*}) = ψ_{2}(2) = 1

X^{*}²={1,2}

 What is the difference between the cumulative likelihoods
α_{t}(i) computed in the forward procedure, and those
δ_{t}(i) computed in the Viterbi algorithm?
The probability
α_{t}(i)=P(o_{1}^{t},
x_{t}=iλ) is based on all possible paths leading to the current
node; whereas
δ_{t}(i)=P(o_{1}^{t},
x_{1}^{t1}, x_{t}=iλ) is based only on the best
path upto and including the current state i.
 Floatingpoint variables with double precision (i.e.,
4 bytes) can store values down to 1e308, typical statetransition
probs are in the order of 0.1, and the multidimensional output probability
would be around 0.01 for a good match.

Given these approximations and assuming no rescaling of the probabilities,
state at what stage would it become impossible to compare competing hypotheses
(i.e., different paths through the trellis)?
In other words, after how many observations would you expect the likelihoods to
suffer from numerical underflow?
So, we need to find the value of T for which (1e3)^{T}=1e308.
T=log(1e308)/log(1e3) = 103, or ∼100 observations.
 With an observation frame rate of 10 ms, roughly how long would
this take?
1 second.
 Instead of storing the likelihoods directly as in the previous
question, we choose to store them as negative log probabilities using a 16bit
unsigned integer (quantising each decade into 32 levels).
 How many decades (factors of ten) can we represent with this data
type?
65536/32 = 2048 decades.
 How many seconds of observations (at 10 ms) could we now process
before suffering from underflow?
2048/(3×100) = 6.83 s, or ∼7 seconds.
Week 6
[week 6 solutions]
Week 8
[homework solutions]
