Ex. 1 
Ex. 2 
Ex. 3 
Ex. 4 
Ex. 5 
Ex. 6
Week 1
Speech recognition:
 Give a definition of automatic speech recognition that distinguishes it from other speech technologies.
 How can speech technologies be said to increase access of
people with disabilities to computerbased systems?
Speech communication:
 What acronym denotes the convention for transcribing the sounds of
the world's languages?
 What is the difference between phones and phonemes?
 What three attributes of natural speech contribute most to the
nonlinguistic aspects known as prosody?
Phonetics:
 How can a basic understanding of phonetics facilitate
the study of speech signals?
 What is a diphthong? Illustrate your answer with an example.
 What class of sounds includes /m/, /n/ and /ŋ/ (or /m,n,N/ in SAMPA)?
 What characteristics of the acoustic signal are most useful for
discriminating vowels?
 Give three environmental factors that can affect the way speech is
produced.
 What are the three places of articulation for English plosive
consonants (a.k.a. stops )?
 What is the main difference between the way that the sounds
/t/ and /s/ are produced?
 What name is given to the effect in fluent speech where, for
example, the phrase ``isn't it'' is pronounced as if it were ``in'it''?
[solutions]
Week 2
Dynamic Time Warping:
 Write a pseudocode description of the DTW algorithm using the transitions
shown in Fig.1 (left).
Apply a distortion penalty for the horizontal (H) and steepest (S) transitions,
d_{H} = d_{S} = d_{μ}/4,
where d_{μ} denotes the mean distance found across the
training data.
 Modify your pseudocode to disallow two consecutive horizontal
transitions, as shown in Fig.1 (right).
 How can silence and wildcard templates be used during
enrollment to help reduce endpoint detection errors?

Fig.1. Permissible DTW transitions.

Speech production:
 What human organ produces quasiperiodic source of voiced sounds, such
as vowels?
 What is fundamental frequency (also called
f_{0}), and how is it produced?
 The vocal tract comprises three main passages. The pharynx and
oral cavity are two. What is the third?
 The velum, larynx and jaw cooperate in the production of speech.
Name two other articulators.
 What is a formant and how is it produced?
Speech analysis:
 What is the name of the organ in the inner ear that is responsible for
converting physical vibrations into a set of nerve responses
(i.e., electrical signals)?
 What is the bandwidth to which the human ear responds (to one significant
figure), and what are the implications?
 If I calculate a DFT directly from a 40ms section of a speech signal, what
will be the spacing of frequency bins in the spectrum?
 Boxcar (rectangular), Kaiser and Blackman define certain kinds of window. Name
three other popular window functions.
 What would be an appropriate window size for a narrowband spectrogram?
[Hint: male f_{0} 80200Hz, female f_{0} 150400Hz]
 Give an estimate of the SNR for a fullscale 16bit speech signal
in relation to the quantisation noise.
[solutions]
Week 3
Markov models:

Given an initial state vector π = [0.25 0.50 0.25], and
statetransition probability matrix
A = [0.25 0.50 0.25; 0.00 0.25 0.50; 0.00 0.00 0.25]:
 draw the model to show the states and permissible transitions;
 calculate the probability of the state sequence X = {1, 2, 2, 3}
 Using the Markov model presented in the lectures,
if today (Monday) has been rainy, what is the most likely weather
 tomorrow,
 in two days' time (i.e., on Wednesday)?
 Calculate the probabilities of rainrainsun, raincloudsun and rainsunsun,
assuming π_{rain} = 1.
 Hence, if we are told that it will be sunny on Wednesday for certain and that
it's rainy today, what's the most likely weather tomorrow?
Hidden Markov models:
 Considering the statetransition topologies shown in Figure 2:
 write an expression for the state duration probability
P(τλ) in Fig.2(a);
 write an expression for the duration probability for each state
P(τx=i,λ)
in Fig.2(b);
 hence derive the distribution of duration probabilities for the entire
model in Fig.2(b);
 how many terms are there in this expression for the model duration
with τ=5?

(a)
(b)
Fig.2. HMM state transitions.

Feature extraction 1:
 How can a bank of bandpass filters (each tuned to a different centre
frequency) be used to extract a feature vector that describes the overall
spectral shape of an acoustic signal at any particular time?
 The real cepstrum of a digital signal sampled at 48 kHz is defined
as c_{s}(m) = IDFT( lnS(k) ), where S(k) is the signal's discrete
Fourier spectrum, ln. denotes the natural logarithm and IDFT is the inverse
discrete Fourier transform.
Considering only real, symmetric elements in the logmagnitude spectrum (i.e.,
the cos terms), draw the shapes of the first four cepstral coefficients
c_{0}, c_{1}, c_{2} and c_{3}, in the
logmagnitude spectral domain.
 What properties of the Melfrequency cepstrum make it more like human
auditory processing, compared to the real cepstrum?
 In calculating MFCCs, what is the purpose of:
 the log operation;
 melfrequency binning;
 Discrete Cosine Transform?
[solutions]
Week 4
Hidden Markov models:
 Draw a trellis diagram for the HMM in Fig.2(b) and a 5frame observation
sequence.
 Show all the paths that arrive in the final null node after the fifth
frame.
 How many different paths are there for this model and number of
observations?
 Imagine you are designing an optical character recognition (OCR) system
for converting images of written words into text, based on HMMs.
The observations you're given come in the form of pixellated greyscale bitmaps
of a single line of writing.
Explain in general terms how you would construct the following components of the
system:
 frames of feature vectors to make up the observation sequences;
 the models (each one comprising a state or set of states);
 suitable annotations to be used during training;
 any special models (e.g., for dealing with blotches or blank spaces
within the pictures).
 How do the components of the OCR system compare to those for an ASR system
designed to perform an Isolated Word Recognition task?
[solutions]
Week 5
HMM decoding:
 In the Viterbi algorithm, what is the purpose of the variable
ψ_{t}(i)?
 What is the meaning of Δ^{*}?
 Using the Viterbi algorithm, calculate the path likelihoods
δ_{t}(i), the value of
Δ^{*}, and use the values of
ψ_{t}(i) to extract the best path X^{*}:
 for observations O¹={G,B,B} (worked example from lecture)
 for observations O²={R,B}
You may assume the following model parameters:
π=[1 0], A=[0.8 0.2; 0.0 0.6],
η=[0 0.4]^{T}, and B=[0.5 0.2 0.3; 0.0 0.9 0.1] where the
columns of the B matrix correspond to green (G), blue (B) and red (R) events,
respectively.
 What is the difference between the cumulative likelihoods
α_{t}(i) computed in the forward procedure, and those
δ_{t}(i) computed in the Viterbi algorithm?
 Floatingpoint variables with double precision (i.e.,
4 bytes) can store values down to 1e308, typical statetransition
probs are in the order of 0.1, and the multidimensional output probability
would be around 0.01 for a good match.

Given these approximations and assuming no rescaling of the probabilities,
state at what stage would it become impossible to compare competing hypotheses
(i.e., different paths through the trellis)?
In other words, after how many observations would you expect the likelihoods to
suffer from numerical underflow?
 With an observation frame rate of 10 ms, roughly how long would
this take?
 Instead of storing the likelihoods directly as in the previous
question, we choose to store them as negative log probabilities using a 16bit
unsigned integer (quantising each decade into 32 levels).
 How many decades (factors of ten) can we represent with this data
type?
 How many seconds of observations (at 10 ms) could we now process
before suffering from underflow?
[solutions]
Week 6
HMM training:
 Using the observation sequences O¹ and O² from Q.3 (week 5) and
your derived Viterbi alignments X^{*}¹ and X^{*}²,
update the model parameters π, A, η and B according to the Viterbi
reestimation for multiple files.
 Assuming initial values of a prototype model to be
π=[1 0], A=[0.5 0.5; 0.0 0.5],
η=[0 0.5]^{T}, B=[1/3 1/3 1/3; 1/3 1/3 1/3]:
 Calculate the forward and backward likelihoods, α and β,
for the first set of observations O¹={G,B,B}.
 Calculate the occupation and transition likelihoods, γ and ξ.
 Use the BaumWelch formulae, which is an implementation of the
Expectation Maximisation procedure, to reestimate values for π, A, η and B.
 Derive an expression for BaumWelch reestimation using multiple
training files for the case of the discrete HMM.
Continuous HMMs:

Based on a univariate Gaussian pdf, b_{i}(o_{t}),
derive an expression for the negative log probability density (neglogprob,
ln b_{i}) in the form of three terms: a constant, a term dependent
only on the variance, and a term dependent on the observations.

 Derive the maximum likelihood (ML) estimate of the mean μ from a known
set of scalar observations, assuming a Gaussian pdf.
[Hint: write the likelihood function as a log probability, and then
differentiate.]
 Derive the ML estimate of the variance Σ in the same way.
 A 2dimensional feature vector is made up of two independent observations
which have standard deviations of 2 and 3 units respectively.
 What are the variances of each of the two dimensions of the
observation vector, o_{t}=[o_{1}(t)
o_{2}(t)]^{T}?
 Hence, write down the 2×2 covariance matrix Σ for
o_{t}.
 Evaluate the determinant of this matrix, Σ.
 Sketch the following pdfs by drawing contours of equal probability
density:
 μ=[0 0]^{T}, Σ=[1 0; 0 ¼];
 μ=[3 2]^{T}, Σ=[4 0; 0 9];
 μ=[2 2]^{T}, Σ=[2 1; 1 2];
 μ=[2 2]^{T}, Σ=[2 1; 1 2].
 Sketch the pdf for Gaussian mixtures with the following parameters:
 Univariate: c_{1}=1/3, μ_{1}=0,
Σ_{1}=1 and c_{2}=2/3, μ_{2}=3,
Σ_{2}=1;
 Bivariate: c_{1}=½, μ_{1}=[4; 4],
Σ_{1}=[2 1; 1 2] and c_{2}=½,
μ_{2}=[4; 4], Σ_{2}=[2 1; 1 2].
 Derive expressions for training Gaussianmixture output pdfs using
multiple files:
 for Viterbi reestimation;
 for BaumWelch reestimation.
[solutions]
