Philip Jackson    

Abstracts of my publications

The University of Surrey

 

Listing  

Journal papers  

Conferences  
    ASA 2007  
    OMYSR 2007  
    UMSRS 2007  
    CVMP 2006  
    ASA 2006  
    OMYSR 2005  
    ASA 2004  
    OMYSR 2004  
    OMYSR 2003  
    ASI 2002  
    ICCBDED 2002  
    ASA-EAA 1999  
    ICVPB 1999  

Book chapter  

Doctoral thesis  

FTP site  


Refereed Conference Abstracts

A Barney, PJB Jackson (2007). "Aerodynamically-based parametric description of the noise envelope in voiced fricatives". In J. Acoust. Soc. Am., 121 (5, Pt. 2): 3122 A, Salt Lake City, UT, USA. [ abstract ]

VD Singampalli, PJB Jackson (2007). "A statistical technique for identifying articulatory roles in speech production". In Proc. One-day Meeting for Young Speech Researchers, p. 25 A, London, UK. [ abstract | poster ]

VD Singampalli, PJB Jackson (2007). "Coarticulatory relations in a compact model of articulatory dynamics". In Proc. one-day meeting on Unified Models for Speech Recognition and Synthesis,* Birmingham, UK, p.3 (A). [ abstract | slides ]

Y Shiga, PJB Jackson (2007). "Comparison of Pruning Strategies for Segmental HMMs". In Proc. one-day meeting on Unified Models for Speech Recognition and Synthesis,* Birmingham, UK, p.7 (A). [ abstract | slides ]

N Nadtoka, PJB Jackson, J Edge, A Hilton, J Tena (2006). "Representing dynamics of facial expressions". In IET Conference on Visual Media Production, London, UK. 1 p. (A). [ abstract ]

A Barney, PJB Jackson (2006). "Modulation of frication noise in a dynamic mechanical model of the larynx and vocal tract". In J. Acoust. Soc. Am., 119 (5, Pt. 2): 3301 A, Providence, RI, USA. [ abstract ]

J Pincas, PJB Jackson (2006). "Detection thresholds for amplitude modulation of noise with simultaneous modulating tone". In J. Acoust. Soc. Am., 119 (5, Pt. 2): 3234 A, Providence, RI, USA. [ abstract ]


Pincas, J. and Jackson, P.J.B. (2005a). Amplitude profiles of fricatives described by temporal moments.
In Proceedings of One-day Meeting for Young Speech Researchers, OMYSR 2005, p. 12, London.

Abstract:

As well as the rapid fluctuations in amplitude that make up the `fine structure' of noise, various degrees of slower loudness change, or envelope fluctuation , are present in fricative sounds. In voiced fricatives, noise is generally amplitude modulated by the voicing component, resulting in a periodic pulsing [Pincas and Jackson 2004, Proc. of From Sound to Sense , MIT, 73- 78]. In addition, all fricatives display some build up and decay of noise power from frication onset to offset. This paper focuses on these latter amplitude changes, which we term amplitude profiles.

Frication build-up and decay for an 8-speaker corpus of intervocalic fricatives was investigated by treating their amplitude profiles as statistical distributions whose properties are fully specified by their first four standard moments: mean, standard deviation, skewness and kurtosis (`peakiness'). This is an adaptation of the spectral moments technique previously used to describe the main features of fricative spectra [Jongman et al. 2000, JASA 108(3):1252-1263].

Analysis of these temporal moments shows that the sibilant/non-sibilant split is consistently manifested in the `flatness' of profiles, whereas voicing status has more effect on whether build-up is skewed towards the beginning or the end of the fricative. These acoustic results are examined in light of probable articulatory explanations.

The perceptual significance of amplitude profiles is also discussed. It is known, for example, that the temporal acuity of the auditory system is good enough to distinguish even very fast amplitude fluctuations [Viemeister 1990, JASA 88(3):1367-1373], but it is unclear to what extent differences in profiles could function as a linguistic cue or naturalness enhancer.

Top
 

abstract | slides ]
 


Jackson, P.J.B., Jesus, L.M.T., Shadle, C.H. and Pincas, J. (2004). Measures of voiced frication for automatic classification.
Journal of the Acoustical Society of America, Vol. 115 (5, Pt. 2), p. 2429, New York NY (abstract).

Abstract:

As an approach to understanding the characteristics of the acoustic sources in voiced fricatives, it seems apt to draw on knowledge of vowels and voiceless fricatives, which have been relatively well studied. However, the presence of both phonation and frication in these mixed-source sounds offers the possibility of mutual interaction effects, with variations across place of articulation. This paper examines the acoustic and articulatory consequences of these interactions and to explore automatic techniques for finding parametric and statistical descriptions of these phenomena. A reliable and consistent set of such acoustic cues could be used for phonetic classification or speech recognition. Following work on devoicing of European Portuguese voiced fricatives [Jesus & Shadle, In Mamede, et al. (Eds.), pp. 1-8, Berlin: Springer-Verlag, 2003] and the modulating effect of voicing on frication [Jackson & Shadle, JASA, 108(4): 1421-1434, 2000], the present study focuses on three types of information: (i) sequences and durations of acoustic events in VC transitions, (ii) temporal, spectral and modulation measures from the periodic and aperiodic components of the acoustic signal, and (iii) voicing activity derived from simultaneous EGG data. Analysis of interactions observed in British/American English and European Portuguese speech corpora will be compared, and the principal findings discussed.

Top
 

abstract ]
 


Russell, M.J. and Jackson, P.J.B. (2004). Regularized re-estimation of stochastic duration models.
Journal of the Acoustical Society of America, Vol. 115 (5, Pt. 2), p. 2429, New York NY (abstract).

Abstract:

Recent research has compared the performance of various distributions (uniform, boxcar, exponential, gamma, discrete) for modeling segment (state) durations in hidden semi-Markov models used for phone classification on the TIMIT database. These experiments have shown that a gamma distribution is more appropriate than exponential (which is implicit in first-order Markov models), and achieved a 3% relative reduction in phone-classification errors [Jackson, Proc. ICPhS, 1349-1352, 2003]. The parameters of these duration distributions were estimated once for each model from initial statistics of state occupation (offline), and remained unchanged during subsequent iterations of training. The present work investigates the effect of re-estimating the duration models in training (online) with respect to the phone-classification scores. First, tests were conducted on duration models re-estimated directly from statistics gathered in the previous iteration of training. It was found that the boxcar and gamma models were unstable, meanwhile the performance of the other models also tended to degrade. Secondary tests, using a scheme of annealed regularization, demonstrated that the losses could be recouped and a further 1% improvement was obtained. The results from this pilot study imply that similar gains in recognition accuracy deserve investigation, along with further optimization of the duration model re-estimation procedure.

Top
 

abstract ]
 


Pincas, J. and Jackson, P.J.B. (2004). Quantifying voicing-frication interaction effects in voiced and voiceless fricatives.
In Proceedings of One-day Meeting for Young Speech Researchers, OMYSR 2004, p. 27, London.

Abstract:

Although speech does not, in general, switch cleanly between periodic and aperiodic noise sources, regions of mixed source sound have received little attention: aerodynamic treatments of source production mechanisms show that interaction will result in decreased amplitude of both sources, and limited previous research has suggested some spectral modification of frication sources by voicing. In this paper, we seek to extend current knowledge of voicing-frication interaction by applying a wider range of measures suitable for quantifying interaction effects to a specially recorded corpus of /VFV/ sequences. We present data for one male and one female subject (from a total of 8). Regions of voicing-frication overlap at the onset of voiceless fricatives often show interaction effects. The extent of such overlapping source regions is investigated with durational data. We have created a measure designed to quantify the magnitude of modulation where overlap does occur, in both these areas and in fully voiced fricatives. We employ high-pass filtering and short-time smoothing to produce an envelope which characterises temporal fluctuation of the aperiodic component. Periodicity at or around the fundamental frequency is interpreted as modulation of frication by voicing, and magnitude of amplitude modulation is computed with spectral analysis of the envelope. Further statistical techniques have been employed to describe the profile of aperiodic sound generation over the course of the fricative. In addition to the above, gradients of f0 contours in VF transitions and total duration of frication are analysed. Results are compared across the voiced/voiceless distinction and place of articulation. Source overlap and interaction effects are often ignored in synthesis systems; thus findings from this paper could potentially be used to improve naturalness of synthetic speech. Planned perceptual experiments will extend the work done by establishing how significant interaction effects are to listeners.

Top
 

abstract ]
 


Moreno, D.M. and Jackson, P.J.B. (2003). A front end using periodic and aperiodic streams for ASR.
In Proceedings of One-day Meeting for Young Speech Researchers, OMYSR 2003, p. 18, London.

Abstract:

Various acoustic mechanisms produce cues in human speech, such as voicing, frication and plosion. Automatic speech recognition (ASR) front ends often treat them alike, although studies demonstrate the dependence of their signal characteristics on the presence or absence of vocal-fold vibration. Typically, Mel-frequency cepstral coefficients (MFCCs) are used to extract features that are not strongly influenced by source characteristics. In contrast, harmonic and noise-like cues were segregated before characterisation, by separating the contribution of voicing from those of other acoustic sources to improve feature extraction for both parts. The pitch-scaled harmonic filter (PSHF) divides an input speech signal into two synchronous streams: periodic and aperiodic, respective estimates of voiced and unvoiced components of the signal at any time. In digit-recognition experiments with the Aurora 2.0 database (clean and noisy conditions, 4kHz bandwidth), features were extracted from each of the decomposed streams, then combined (by concatenation or further manipulation) into an extended feature vector. Thus, the noise robustness of our parameterisation was compared against a conventional one (39 MFCCs, deltas, delta-deltas). Each separate stream reduced recognition accuracy by less than 1% absolute, compared to the baseline on the original speech; combined, they increased accuracy under noisy conditions (by 7.8% under 5dB SNR, after multi-condition training). Voiced regions provided resilience to corruption by noise. However, no significant improvement on 99.0% baseline accuracy was achieved under clean test conditions. Principal component analysis (PCA) of concatenated features tended to perform better than of the separate streams, and PCA of static coefficients better than after calculation of deltas. With PCA of concatenated static MFCCs, plus deltas, the improvement was 5.6%, implying some redundancy between the complementary streams. Future plans to evaluate the PSHF front end for phoneme recognition with higher bandwidth could help to identify the source of these substantial performance benefits.

Top
 

abstract | slides ]
 


Jackson, P.J.B., Lo, B.-H. and Russell, M.J. (2002). Models of speech dynamics for ASR, using intermediate linear representations.
Presented at NATO Advanced Study Institute on the Dynamics of Speech Production and Perception, Il Ciocco, Italy.

Abstract:

A theoretical and experimental analysis of a simple multi-level segmental HMM is presented in which the relationship between symbolic (phonetic) and surface (acoustic) representations of speech is regulated by an intermediate (articulatory) layer, where speech dynamics are modeled using linear trajectories. Three formant-based parameterizations and measured articulatory positions are considered as intermediate representations, from the TIMIT and MOCHA corpora respectively. The articulatory-to-acoustic mapping was performed by between 1 and 49 linear transformations. Results of phone-classification experiments demonstrate that, by appropriate choice of intermediate parameterization and mappings, it is possible to achieve close to optimal performance.

Top
 

abstract | ppt ]
 


Jackson, P.J.B. (2002). Mama and papa: the ancestors of modern-day speech science.
In Proceedings of the International Conference and Commemoration of the Bicentenary of the Death of Erasmus Darwin, ICCBDED, p. 14, Lichfield, UK.

Abstract:

Erasmus Darwin's writings on the subject of human speech included discussion of the alphabet as an unsatisfactory phonetic representation of the spoken word, of mechanisms of speech production and, indeed, of a mechanical speaking machine [1,2]. His studies of the acoustic properties of speech were limited, as it was not until many generations later that the physical behaviour of sound waves began to be understood in any detail [3]. Nevertheless, his analysis of sounds on the basis of their manner of production and place of articulation was highly insightful, and is comparable to the classification scheme laid down by the International Phonetic Association. Furthermore, the wooden and leather device he had built was capable of pronouncing the vowel /a/ and labial consonants which, in English, are /p/, /b/ and /m/. These could be combined to create some simple utterances, as in my title. This paper will examine many of the technical aspects of Darwin's investigations into the nature of speech, and relate them to the findings of contemporary research in the field. In particular, it will review the application of articulatory information in approaches to speech synthesis, and show how magnetic resonance images, together with a model of the vocal-tract acoustics, can be used for such purposes. Where appropriate, demonstrations will be given, to illustrate the different aspects of the technology, and connexions will be made between those aspects that Darwin brought to light and what speech science knows of them now.

References:

  1. Darwin, Erasmus (1803), "The Temple of Nature", J. Johnson, London, Add. Note XV:107-120.
  2. King-Hele, Desmond (1981), "The Letters of Erasmus Darwin", Cambridge University Press, Cambridge, UK.
  3. Lord Rayleigh (1877), "The Theory of Sound", 2nd edition, Dover, New York.

Session: Erasmus Darwin and technology
 

Top
 

abstract | ppt ]
 


Jackson, P.J.B. and Shadle, C.H. (1999). Modelling vocal-tract acoustics validated by flow experiments.
Journal of the Acoustical Society of America, Vol. 105 (2, Pt. 2), p. 1161, Berlin, Germany (abstract).

Abstract:

Modelling the acoustic response of the vocal tract is a complex task, both from the point of view of acquiring details of its internal geometry and of accounting for the acoustic-flow interactions. A vocal-tract acoustics program (VOAC) has been developed [P. Davies, R. McGowan & C. Shadle, Vocal Fold Phys., ed. I. Titze, San Diego: Singular Pub., 93-142 (1993)], which uses a more realistic, aeroacoustic model of the vocal tract than classic electrical-analogue representations. It accommodates area and hydraulic radius profiles, smooth and abrupt area changes, incorporating end-corrections, side-branches, and net fluid flows, including turbulence losses incurred through jet formation. Originally, VOAC was tested by comparing vowel formant frequencies (i) uttered by subjects, (ii) predicted using classic electrical analogues, and (iii) predicted by VOAC. In this study, VOAC is further validated by comparing the predicted frequency response functions for a range of flow rates with measurements of the radiated sound from a series of mechanical models of unvoiced fricatives [C. Shadle, PhD thesis, MIT-RLE Tech. Rpt. 506 (1985)]. Results show VOAC is more accurate in predicting the complete spectrum at a range of flow rates. Finally, preliminary work is presented with VOAC used to simulate the sound generated at a sequence of stages during the release of a plosive.
 

Top
 

abstract ]
 


Jackson, P.J.B. and Shadle, C.H. (1999). Analysis of mixed-source speech sounds: aspiration, voiced fricatives and breathiness.
In Proceedings of the 2nd International Conference on Voice Physiology and Biomechanics, p. 30, Berlin, Germany (abstract).

Abstract:

Our initial goal was to model the source characteristics of aspiration more accurately. The term is used inconsistently in the literature, but there is general agreement that aspiration is produced by turbulence noise generated in the vicinity of the glottis. Thus, in order to model aspiration, we must refine its concept, and in particular define its relation to other kinds of noise produced near the glottis, such as breathiness and hoarseness. For instance, do similar aeroacoustic processes operate transiently during a plosive release and steadily during a breathy vowel? In unvoiced fricatives, localized sources produce well-defined spectral troughs. We have therefore developed a series of analysis methods that generate spectra for transient and voice-and-noise-excited sounds. These methods include pitch-synchronous decomposition into harmonic and anharmonic components (based on a hoarseness metric of Muta et al., 1988), short-time spectra, ensemble averaging, and short-time harmonics-to-noise ratios (Jackson and Shadle, 1998). These have been applied to a corpus of repeated nonsense words consisting of aspirated stops in three vowel contexts and voiced and unvoiced fricatives, spoken in four voice qualities, thus providing multiple examples of mixed-source and transient-source speech sounds. Ensemble-averaged spectra derived throughout a stop release show evidence of a highly-localized noise source becoming more distributed. Variations by place are also apparent, complementing and extending previous work (Stevens and Blumstein, 1978; Stevens, 1993). The coordination of glottal and supraglottal articulation, described and modelled for aspiration by Scully and Mair (1995), is in a sense reversed for voiced fricatives. Use of the decomposition algorithm on voiced fricatives revealed greater complexity than expected: the anharmonic component appears sometimes to be modulated by the harmonic component, sometimes to be independent of it, and tends to change from one case to the other in the course of the fricative. In sum, we have made some progress in describing not only spectral but time-varying properties of an aspiration model, and in so doing, have improved our descriptions of other mixed-source, time-varying speech sounds.
 

Top
 

abstract ]
 


CVSSP [Colleagues | Group | Dept. | Faculty | Univ.]

© 2002-7, maintained by Philip Jackson, last updated on 24 August 2007.

EE