An integrated multiple-level statistical model for speech pattern processing

University of Birmingham


    Martin Russell
    Philip Jackson
    Boon-Hooi Lo
    Nick Wilkinson
    Michael Wong

Related research
    by Martin
    by Philip (Dansa)
    by Philip (Columbo)
    by Philip (Nephthys)
    by DSVP group



Balthasar de Beaujoyeulx,
Balet Comique de la Royne.
Paris, 1582. Hirsch III.629.
© 2000 British Library
(with permission, any
further reproduction


Almost all successful automatic speech recognition (ASR) systems use Hidden Markov Models (HMMs) to model the acoustic realisations of words or phonemes. Typically this approach requires a large set of context-sensitive phoneme-level models trained on a substantial corpus of speech data. Impressive results have been achieved using this method. For example, the best system in the November 1994 evaluation co-ordinated by DARPA scored a word error rate of 7.2% on read speech from North American Business News texts using a 65,000 word vocabulary (Young 1995). The success achieved in the DARPA programme has stimulated significant commercial activity, and a number of systems based on this technology are now on the market, most notably from Dragon Systems and IBM.

Despite these impressive achievements, we still have no real understanding of how to incorporate speech knowledge as a computationally useful constraint in a practical model for speech recognition. We make limited use of speech knowledge, for example in context-sensitive phone-level modelling, but this has not driven significant fundamental improvements to the underlying framework for speech modelling, with the consequence that as recording conditions and speaking style become less controlled, performance drops dramatically (Weintraub et al. 1996). For example, state-of-the-art performance on the Switchboard corpus of spontaneous telephone conversational speech is around 40% word errors. Mainstream speech recognition research attempts to raise performance on these more demanding tasks by further incremental improvements to the conventional HMM formalism, however, there is growing opinion in the research community that more fundamental advances are necessary (Bourlard 1995; Russell 1997).

HMMs owe their success to the combination of a broadly appropriate formalism for modelling temporal and short-term spectral variability in time-varying patterns, and powerful formal mathematical methods for data-driven model parameter optimisation and for classification. However, in order to achieve mathematical tractability, HMMs make assumptions which are at variance with the true properties of speech patterns and which render them incapable of exploiting important structure due to constraints in the speech production process. These include the assumptions that the sequence of acoustic vectors which represent a spoken utterance are independent in time and piece-wise stationary, and that deviations from this piecewise stationary structure are due to random variation.

In recent years, a number of more fundamental extensions of the HMM formalism have been proposed which overcome some of these limitations. Many of these fall into the class of segment models, in which HMM states correspond to sequences of acoustic vectors (or segments), rather than individual vectors (Ostendorf et al 1996; Holmes and Russell 1999). In this way it is possible to overcome the HMM independence assumption and capture the underlying continuity in speech patterns. Relevant research has been conducted in the USA (Ostendorf et al 1996), Canada (Deng 1997) and Israel (Goldberger and Burshtein 1998). In the UK there has been relevant work at Cambridge University (Gales and Young 1993) and Imperial College London (Wierwiorka and Brookes 1996). In addition, there has been significant activity at the DERA Speech Research Unit, where an approach to segment modelling based on trajectories in the acoustic vector space has been developed (Holmes and Russell 1999). In addition to proposing new speech modelling formalisms, an important goal of this research has been the extension of conventional HMM training and recognition algorithms, and experimental evaluation.

In almost all existing research, segment modelling techniques have been applied in a spectrum-based domain, even though the dynamics of the principle articulators correlate more closely with motion across, rather than within, frequency bands. For this reason, and because of the potential to apply articulatory constraints, it is desirable to apply these techniques in a domain more closely related to the underlying articulation. However, simple replacement of spectral with articulatory-based features is unlikely to be successful, since extraction of such features is a classification problem in its own right and notoriously unreliable. In addition, the principle of delayed decision making, which is one of the cornerstones of the success of HMMs, dictates that any intermediate, articulatory-based description of an utterance should emerge as a consequence of, rather than a precursor to, the recognition process. This suggests an alternative approach in which an articulatory-based representation is used as an intermediate layer between the state and surface-acoustic level descriptions. In such an approach, basic acoustic pattern-matching would still take place at the surface-acoustic level, but would be subject to static and dynamic constraints from the intermediate level. The relationship between the state-level and intermediate-level representations could be described using new or existing trajectory-based dynamic segment modelling techniques, but an additional layer of structure is required to describe the correspondence between the intermediate and surface-acoustic levels. Both of these levels need to be integrated into a single mathematical formalism, and techniques for model parameter optimisation and speech pattern classification need to be developed.

Although the intermediate level representation has been referred to as `articulatory' there is no obvious need for this interpretation to be explicit. The basic requirement is for a low dimensional intermediate representation which is able to capture major speaker characteristics and has topological properties which are appropriate for modelling speech statics and dynamics. This could be a formant-based representation, reflecting resonances of the vocal tract, or a more abstract representation derived automatically from data (Richards and Bridle 1999).

If successful, this research will help to overcome the limitations of conventional approaches to automatic speech recognition by providing a framework which is both mathematically rigorous and able to accommodate constraints derived from the underlying statics and dynamics of the speech production process. In principle, this should lead to an ability to model phenomena such as co-articulation and articulatory effort and hence to improved speech recognition performance on natural, conversational speech. A model of the structure of speech patterns which relies less on assumptions of random variation should also lead to improved recognition of speech in noise. Moreover, the `distillation' of speaker characteristics into a low-dimensional intermediate representation which is either explicitly or implicitly based on articulatory structure would provide a basis for fast speaker-adaptation. Success would also constitute a significant step towards the development of a unified framework for speech pattern processing which could support both recognition and synthesis. For example, if the intermediate representation were formant-based, then the mapping from the intermediate level to the surface level could be interpreted as a data-driven alternative to conventional formant synthesis-by-rule. This in turn would have important implications for very low bit rate recognition-synthesis speech coders (Holmes 1998), for low-bit-rate communications and for very economical speech storage.



H Bourlard (1995), "Towards increasing speech recognition error rates", Keynote Paper, Proc. Eurospeech'95, Madrid.

H Bourlard and S Dupont (1986), "A new ASR approach based on independent processing and recombination of partial frequency bands", Proc. ICSLP'96.

L Deng (1997), "A dynamic, feature-based approach to speech modeling and recognition", Invited Paper, Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, USA, 107-114.

M J F Gales and S J Young (1993), "Segmental hidden Markov models", Proc. Eurospeech'93, 1579-1582.

J Goldberger and D Burshtein (1998), "Scaled random trajectory segment models", Computer Speech and Language, 12, 1, 51-73.

J N Holmes, W J Holmes and P N Garner (1997), "Using formant frequencies in speech recognition", Proc. Eurospeech'97, 2083-2086.

W J Holmes (1997), "Modelling segmental variability for automatic speech recognition", PhD Thesis, University of London.

W J Holmes and M J Russell (1999), "Probabilistic trajectory segmental HMMs", Computer Speech and Language, 13, 1, 3-38.

D Iskra and W Edmunson (1998), "Feature-based approach to speech recognition", Proc. IOA, 20, 6, 83-89.

L F Lamel, R H Kasel and S Seneff (1986), "Speech database development : design and analysis of the acoustic-phonetic corpus", Proc. DARPA Speech Recognition Workshop, 100-109.

L A Liporace (1982), "Maximum likelihood estimation for multivariate observations of Markov sources", IEEE Transactions on Information Theory, 28, 729-734.

M Ostendorf, V Digalakis and O A Kimball (1996), "From HMMs to segment models: A unified view of stochastic modelling for speech recognition", IEEE Transactions on Speech and Audio Processing, 4, 5, 360-378.

H B Richards and J S Bridle, "The HDM: A segmental hidden dynamic model of coarticulation", Proc. IEEE ICASSP'99.

M J Russell and R K Moore (1985), "Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition", Proc. IEEE ICASSP'85.

M J Russell, K M Ponting, S M Peeling, S R Browning, J S Bridle and R K Moore (1990), "The ARM continuous speech recognition system", Proc. IEEE ICASSP'90.

M J Russell (1992), "A segmental statistical model for speech pattern processing", Proc Institute of Acoustics, Vol 14: Pt 6, 503-510.

M Russell (1997), "Progress towards speech models that model speech", Invited Paper, Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, USA, 115-123.

M J Russell and W J Holmes (1997), "Linear trajectory segmental HMMs", IEEE Signal Processing Letters, 4, 72-74.

M J Tomlinson, M J Russell, R K Moore, A P Buckland and M A Fawley (1997), "Modelling asynchrony in speech using elementary single-signal decomposition", Proc. IEEE ICASSP'97.

M Weintraub, K Taussig, K Hunicke-Smith, A Snodgrass (1996), "Effect of Speaking Style on LVCSR Performance", Proc. Addendum ICSLP'96, 16-19.

A Wiewiorka and D M Brookes (1996), "Exponential interpolation of states in a hidden Markov model", Proc. IOA, 18, 9, 201-208.

Steve Young, Julian Odell, Dave Ollason, Valtcho Valtchev and Phil Woodland, (1997), "The HTK Book, Version 2.1", Entropic Cambridge Research Laboratory.

Steve Young (1995), "Large vocabulary continuous speech recognition: a review", Proc. IEEE Automatic Speech Recognition Workshop, Snowbird, Utah, 3-28.

Group [ Group | Research | School | Univ ]

© 2002-4, maintained by Philip Jackson, last updated on 5 February 2004.