BALTHASAR:

An integrated multiple-level statistical model for speech pattern processing

People
Martin Russell
Philip Jackson
Boon-Hooi Lo
Nick Wilkinson
Michael Wong

Balthasar de Beaujoyeulx,
Balet Comique de la Royne.
Paris, 1582. Hirsch III.629.
© 2000 British Library
(with permission, any
further reproduction
prohibited).

EPSRC grant reference: GR/M87146

The name BALTHASAR stands for the Birmingham Articulatory-layered Linear-Trajectory HMM-based Automatic Speech Recognizer. It is a three year, EPSRC-funded research project which was started in February 2000, and has produced a number of conference and journal publications that are available below.

Overview

The goal is to develop a rigorous theory of multiple-level statistical modelling of acoustic speech patterns. Within this framework, the relationship between the symbolic (state-level) and surface-acoustic representations of speech is regulated by an intermediate representation which is able to capture the inherent constraints of the speech production process. The intermediate representation is referred to as the articulatory, or pseudo-articulatory, layer. Ideally, this intermediate representation will also provide a low-dimensional characterisation of speaker properties, which offers the potential for rapid speaker adaptation. This characterisation may be explicit, as in the case of an actual articulatory model, or implicit. The research will involve the derivation of a suitable mathematical framework, extension of the relevant training and recognition algorithms, and evaluation of the resulting models through off-line experimentation on an international-standard speech corpus.

An important objective for the project will be to demonstrate that it is possible to derive extended versions of the Hidden Markov Model (HMM) Viterbi decoding algorithm (for recognition or training) and Baum-Welch algorithm (for training) which are valid for integrated, multiple-level models. Of these, it is anticipated that the training algorithm will present most difficulty. By analogy with the approach which was taken for trajectory-based segmental HMMs (Russell 1992; Russell and Holmes 1997), we shall try to extend the derivation of the conventional Baum-Welch re-estimation method due to Liporace (1982). An important factor will be the nature of the mapping between the intermediate and surface-acoustic layers of the model.

For simplicity, the project will focus initially on integrated, multiple-level versions of conventional HMMs, in which states correspond to individual points (rather than sequences of points) in the intermediate representation. This stage is necessary to understand difficulties associated with the inclusion of an intermediate layer in a relatively simple model, as problems will be exacerbated in a segmental framework.

New algorithms will be implemented as software tools in the HTK environment (Young et al. 1997), which is the international de facto standard for experimental HMM-based speech recognition. Experimental evaluation will use the TIMIT speech corpus (Lamel et al 1986), to allow comparison with results from other laboratories and ensure that the results are relevant from an international perspective.

Related publications

While the core technology remains under development, recognition results from tests with our new system are still preliminary. However, we have been publishing our findings as we go:

MJ Russell, X Zheng, PJB Jackson (2007). "Modelling speech signals using formant frequencies as an intermediate representation". IET Signal Processing, 1 (1): 43-50. [ bib | doi | abstract | preprint ]

MJ Russell, PJB Jackson (2005). "A multiple-level linear/linear segmental HMM with a formant-based intermediate layer". Computer Speech and Language, 19 (2): 205-225. [ bib | doi | abstract | preprint ]

MJ Russell, PJB Jackson (2004). "Regularized re-estimation of stochastic duration models". In J. Acoust. Soc. Am., 115 (5, Pt. 2): 2429 A, New York, New York, USA. [ abstract ]

MJ Russell, PJB Jackson (2003). "The effect of an intermediate articulatory layer on the performance of a segmental HMM". In Proc. Eurospeech 2003, 2737-2740, Geneva. [ abstract | pdf ]

PJB Jackson (2003). "Improvements in phone-classification accuracy from modelling duration". In Proc. Int. Cong. of Phon. Sci., ICPhS 2003, 1349-1352, Barcelona. [ abstract | pdf ]

MJ Russell, PJB Jackson, MLP Wong (2003). "Development of articulatory-based multi-level segmental HMMs for phonetic classification in ASR". In Proc. EURASIP Conf. on Video/Image Proc. & Multimedia Comm., EC-VIP-MC 2003, 2: 655-660, Zagreb, Croatia. [ bib | doi | abstract | preprint ]

PJB Jackson, MJ Russell (2002). Models of speech dynamics in a segmental-HMM recognizer using intermediate linear representations. In Proc. Int. Conf. on Spoken Lang. Proc., ICSLP 2002, 1253-1256, Denver, Colorado, USA. [ abstract | pdf | ps ]

N Wilkinson, MJ Russell (2002). Improved phone recognition on TIMIT using formant frequency data and confidence measures. In Proc. Int. Conf. on Spoken Lang. Proc., ICSLP 2002, 2121-2124, Denver, Colorado, USA. [ abstract | doc | pdf | ps ]

PJB Jackson, B-H Lo, MJ Russell (2002). "Models of speech dynamics for ASR, using intermediate linear representations". Presented at NATO Advanced Study Institute on the Dynamics of Speech Production and Perception, Il Ciocco, Italy. [ abstract | ppt ]

PJB Jackson, B-H Lo, MJ Russell (2002). Data-driven, non-linear, formant-to-acoustic mapping for ASR. Electronics Letters, 38 (13): 667-669. [ bib | doi | abstract | preprint ]

N Wilkinson, MJ Russell (2001). Progress towards improved speech modelling using asynchronous sub-bands and formant frequencies. In Proc. Inst. Acoust., WISP 2001, 23 (3): 27-36, Stratford-upon-Avon, UK. [ pdf ]

PJB Jackson (2001). Acoustic cues of voiced and voiceless plosives for determining place of articulation. In Proc. Workshop on Consistent and Reliable Acoustic Cues for sound analysis, CRAC 2001, 19-22, Aalborg, Denmark. [ abstract | pdf | ps ]