Balthasar project | Introduction

Almost all successful automatic speech recognition (ASR) systems use Hidden Markov Models (HMMs) to model the acoustic realisations of words or phonemes. Typically this approach requires a large set of context-sensitive phoneme-level models trained on a substantial corpus of speech data. Impressive results have been achieved using this method. For example, the best system in the November 1994 evaluation co-ordinated by DARPA scored a word error rate of 7.2% on read speech from North American Business News texts using a 65,000 word vocabulary (Young 1995). The success achieved in the DARPA programme has stimulated significant commercial activity, and a number of systems based on this technology are now on the market, most notably from Dragon Systems and IBM.

Despite these impressive achievements, we still have no real understanding of how to incorporate speech knowledge as a computationally useful constraint in a practical model for speech recognition. We make limited use of speech knowledge, for example in context-sensitive phone-level modelling, but this has not driven significant fundamental improvements to the underlying framework for speech modelling, with the consequence that as recording conditions and speaking style become less controlled, performance drops dramatically (Weintraub et al. 1996). For example, state-of-the-art performance on the Switchboard corpus of spontaneous telephone conversational speech is around 40% word errors. Mainstream speech recognition research attempts to raise performance on these more demanding tasks by further incremental improvements to the conventional HMM formalism, however, there is growing opinion in the research community that more fundamental advances are necessary (Bourlard 1995; Russell 1997).

HMMs owe their success to the combination of a broadly appropriate formalism for modelling temporal and short-term spectral variability in time-varying patterns, and powerful formal mathematical methods for data-driven model parameter optimisation and for classification. However, in order to achieve mathematical tractability, HMMs make assumptions which are at variance with the true properties of speech patterns and which render them incapable of exploiting important structure due to constraints in the speech production process. These include the assumptions that the sequence of acoustic vectors which represent a spoken utterance are independent in time and piece-wise stationary, and that deviations from this piecewise stationary structure are due to random variation.

In recent years, a number of more fundamental extensions of the HMM formalism have been proposed which overcome some of these limitations. Many of these fall into the class of segment models, in which HMM states correspond to sequences of acoustic vectors (or segments), rather than individual vectors (Ostendorf et al 1996; Holmes and Russell 1999). In this way it is possible to overcome the HMM independence assumption and capture the underlying continuity in speech patterns. Relevant research has been conducted in the USA (Ostendorf et al 1996), Canada (Deng 1997) and Israel (Goldberger and Burshtein 1998). In the UK there has been relevant work at Cambridge University (Gales and Young 1993) and Imperial College London (Wierwiorka and Brookes 1996). In addition, there has been significant activity at the DERA Speech Research Unit, where an approach to segment modelling based on trajectories in the acoustic vector space has been developed (Holmes and Russell 1999). In addition to proposing new speech modelling formalisms, an important goal of this research has been the extension of conventional HMM training and recognition algorithms, and experimental evaluation.

In almost all existing research, segment modelling techniques have been applied in a spectrum-based domain, even though the dynamics of the principle articulators correlate more closely with motion across, rather than within, frequency bands. For this reason, and because of the potential to apply articulatory constraints, it is desirable to apply these techniques in a domain more closely related to the underlying articulation. However, simple replacement of spectral with articulatory-based features is unlikely to be successful, since extraction of such features is a classification problem in its own right and notoriously unreliable. In addition, the principle of delayed decision making, which is one of the cornerstones of the success of HMMs, dictates that any intermediate, articulatory-based description of an utterance should emerge as a consequence of, rather than a precursor to, the recognition process. This suggests an alternative approach in which an articulatory-based representation is used as an intermediate layer between the state and surface-acoustic level descriptions. In such an approach, basic acoustic pattern-matching would still take place at the surface-acoustic level, but would be subject to static and dynamic constraints from the intermediate level. The relationship between the state-level and intermediate-level representations could be described using new or existing trajectory-based dynamic segment modelling techniques, but an additional layer of structure is required to describe the correspondence between the intermediate and surface-acoustic levels. Both of these levels need to be integrated into a single mathematical formalism, and techniques for model parameter optimisation and speech pattern classification need to be developed.

Although the intermediate level representation has been referred to as `articulatory' there is no obvious need for this interpretation to be explicit. The basic requirement is for a low dimensional intermediate representation which is able to capture major speaker characteristics and has topological properties which are appropriate for modelling speech statics and dynamics. This could be a formant-based representation, reflecting resonances of the vocal tract, or a more abstract representation derived automatically from data (Richards and Bridle 1999).

If successful, this research will help to overcome the limitations of conventional approaches to automatic speech recognition by providing a framework which is both mathematically rigorous and able to accommodate constraints derived from the underlying statics and dynamics of the speech production process. In principle, this should lead to an ability to model phenomena such as co-articulation and articulatory effort and hence to improved speech recognition performance on natural, conversational speech. A model of the structure of speech patterns which relies less on assumptions of random variation should also lead to improved recognition of speech in noise. Moreover, the `distillation' of speaker characteristics into a low-dimensional intermediate representation which is either explicitly or implicitly based on articulatory structure would provide a basis for fast speaker-adaptation. Success would also constitute a significant step towards the development of a unified framework for speech pattern processing which could support both recognition and synthesis. For example, if the intermediate representation were formant-based, then the mapping from the intermediate level to the surface level could be interpreted as a data-driven alternative to conventional formant synthesis-by-rule. This in turn would have important implications for very low bit rate recognition-synthesis speech coders (Holmes 1998), for low-bit-rate communications and for very economical speech storage.

BALTHASAR:

An integrated multiple-level statistical model for speech pattern processing

Introduction

References