|
|
Abstracts of my publications
|
|
|
|
Listing
Journal papers
Conferences
ASA 2007
OMYSR 2007
UMSRS 2007
CVMP 2006
ASA 2006
OMYSR 2005
ASA 2004
OMYSR 2004
OMYSR 2003
ASI 2002
ICCBDED 2002
ASA-EAA 1999
ICVPB 1999
Book chapter
Doctoral thesis
FTP
site
|
Refereed Conference Abstracts
A Barney, PJB Jackson
(2007).
"Aerodynamically-based parametric description of the noise envelope in voiced fricatives".
In J. Acoust. Soc. Am.,
121 (5, Pt. 2): 3122 A,
Salt Lake City, UT, USA.
[ abstract ]
VD Singampalli, PJB Jackson
(2007).
"A statistical technique for identifying articulatory roles in speech production".
In Proc. One-day Meeting for Young Speech
Researchers,
p. 25 A,
London, UK.
[ abstract | poster ]
VD Singampalli, PJB Jackson
(2007).
"Coarticulatory relations in a compact model of articulatory dynamics".
In Proc. one-day meeting on Unified Models for Speech Recognition and Synthesis,*
Birmingham, UK,
p.3 (A).
[ abstract |
slides ]
Y Shiga, PJB Jackson
(2007).
"Comparison of Pruning Strategies for Segmental HMMs".
In Proc. one-day meeting on Unified Models for Speech Recognition and Synthesis,*
Birmingham, UK,
p.7 (A).
[ abstract |
slides ]
N Nadtoka, PJB Jackson, J Edge, A Hilton, J Tena
(2006).
"Representing dynamics of facial expressions".
In IET Conference on Visual Media Production,
London, UK.
1 p. (A).
[ abstract ]
A Barney, PJB Jackson
(2006).
"Modulation of frication noise in a dynamic mechanical model of the larynx and vocal tract".
In J. Acoust. Soc. Am.,
119 (5, Pt. 2): 3301 A,
Providence, RI, USA.
[ abstract ]
J Pincas, PJB Jackson
(2006).
"Detection thresholds for amplitude modulation of noise with simultaneous modulating tone".
In J. Acoust. Soc. Am.,
119 (5, Pt. 2): 3234 A,
Providence, RI, USA.
[ abstract ]
Pincas, J. and Jackson, P.J.B. (2005a).
Amplitude profiles of fricatives described by temporal moments.
In Proceedings of One-day Meeting for Young Speech Researchers,
OMYSR 2005,
p. 12, London.
Abstract:
As well as the rapid fluctuations in amplitude that make up the `fine structure'
of noise, various degrees of slower loudness change, or envelope fluctuation ,
are present in fricative sounds.
In voiced fricatives, noise is generally amplitude modulated by the voicing
component, resulting in a periodic pulsing [Pincas and Jackson 2004, Proc. of
From Sound to Sense , MIT, 73- 78].
In addition, all fricatives display some build up and decay of noise power from
frication onset to offset.
This paper focuses on these latter amplitude changes, which we term amplitude
profiles.
Frication build-up and decay for an 8-speaker corpus of intervocalic fricatives
was investigated by treating their amplitude profiles as statistical
distributions whose properties are fully specified by their first four standard
moments: mean, standard deviation, skewness and kurtosis (`peakiness').
This is an adaptation of the spectral moments technique previously used to
describe the main features of fricative spectra [Jongman et al. 2000, JASA
108(3):1252-1263].
Analysis of these temporal moments shows that the sibilant/non-sibilant split is
consistently manifested in the `flatness' of profiles, whereas voicing status
has more effect on whether build-up is skewed towards the beginning or the end
of the fricative.
These acoustic results are examined in light of probable articulatory
explanations.
The perceptual significance of amplitude profiles is also discussed.
It is known, for example, that the temporal acuity of the auditory system is
good enough to distinguish even very fast amplitude fluctuations [Viemeister
1990, JASA 88(3):1367-1373], but it is unclear to what extent differences in
profiles could function as a linguistic cue or naturalness enhancer.
|
|
Top
|
[ abstract |
slides ]
|
|
Jackson, P.J.B., Jesus, L.M.T., Shadle, C.H. and Pincas, J. (2004).
Measures of voiced frication for automatic classification.
Journal of the Acoustical Society of America,
Vol. 115 (5, Pt. 2), p. 2429, New York NY (abstract).
Abstract:
As an approach to understanding the characteristics of the acoustic
sources in voiced fricatives, it seems apt to draw on knowledge of vowels
and voiceless fricatives, which have been relatively well studied.
However,
the presence of both phonation and frication in these mixed-source
sounds offers the possibility of mutual interaction effects, with
variations across place of articulation.
This paper examines the acoustic and articulatory consequences
of these interactions and to explore automatic techniques for finding
parametric and statistical descriptions of these phenomena.
A reliable and consistent set of such acoustic cues
could be used for phonetic classification or speech recognition.
Following work on devoicing of European Portuguese voiced fricatives
[Jesus & Shadle, In Mamede, et al. (Eds.),
pp. 1-8, Berlin: Springer-Verlag, 2003]
and the modulating effect of voicing on frication
[Jackson & Shadle, JASA, 108(4): 1421-1434, 2000],
the present study focuses on three types of information:
(i) sequences and durations of acoustic events in VC transitions,
(ii) temporal, spectral and modulation measures from the periodic and
aperiodic components of the acoustic signal, and
(iii) voicing activity derived from simultaneous EGG data.
Analysis of interactions observed in British/American English and
European Portuguese speech corpora will be compared, and the principal
findings discussed.
|
|
Top
|
[ abstract ]
|
|
Russell, M.J. and Jackson, P.J.B. (2004).
Regularized re-estimation of stochastic duration models.
Journal of the Acoustical Society of America,
Vol. 115 (5, Pt. 2), p. 2429, New York NY (abstract).
Abstract:
Recent research has compared the performance of various distributions
(uniform, boxcar, exponential, gamma, discrete) for modeling segment
(state) durations in hidden semi-Markov models used for phone
classification on the TIMIT database.
These experiments have shown that a gamma distribution is more
appropriate than exponential (which is implicit in first-order Markov
models), and achieved a 3% relative reduction in
phone-classification errors
[Jackson, Proc. ICPhS, 1349-1352, 2003].
The parameters of these duration distributions were estimated once
for each model from initial statistics of state occupation (offline),
and remained unchanged during subsequent iterations of training.
The present work investigates the effect of re-estimating the
duration models in training (online) with respect to the
phone-classification scores.
First, tests were conducted on duration models re-estimated directly
from statistics gathered in the previous iteration of training.
It was found that the boxcar and gamma models were unstable,
meanwhile the performance of the other models also tended to degrade.
Secondary tests, using a scheme of annealed regularization,
demonstrated that the losses could be recouped and a further 1%
improvement was obtained.
The results from this pilot study imply that similar gains in
recognition accuracy deserve investigation, along with further
optimization of the duration model re-estimation procedure.
|
|
Top
|
[ abstract ]
|
|
Pincas, J. and Jackson, P.J.B. (2004).
Quantifying voicing-frication interaction effects in voiced and voiceless fricatives.
In Proceedings of One-day Meeting for Young Speech
Researchers, OMYSR 2004,
p. 27, London.
Abstract:
Although speech does not, in general, switch cleanly between periodic and
aperiodic noise sources, regions of mixed source sound have received little
attention: aerodynamic treatments of source production mechanisms show
that interaction will result in decreased amplitude of both sources, and limited
previous research has suggested some spectral modification of frication
sources by voicing.
In this paper, we seek to extend current knowledge of voicing-frication
interaction by applying a wider range of measures suitable for quantifying
interaction effects to a specially recorded corpus of /VFV/ sequences.
We present data for one male and one female subject (from a total of 8).
Regions of voicing-frication overlap at the onset of voiceless fricatives often
show interaction effects.
The extent of such overlapping source regions is investigated with durational
data.
We have created a measure designed to quantify the magnitude of
modulation where overlap does occur, in both these areas and in fully voiced
fricatives.
We employ high-pass filtering and short-time smoothing to produce an
envelope which characterises temporal fluctuation of the aperiodic
component.
Periodicity at or around the fundamental frequency is interpreted as
modulation of frication by voicing, and magnitude of amplitude modulation is
computed with spectral analysis of the envelope.
Further statistical techniques have been employed to describe the profile of
aperiodic sound generation over the course of the fricative.
In addition to the above, gradients of f0 contours in VF transitions and total
duration of frication are analysed.
Results are compared across the voiced/voiceless distinction and place of
articulation.
Source overlap and interaction effects are often ignored in synthesis
systems; thus findings from this paper could potentially be used to improve
naturalness of synthetic speech.
Planned perceptual experiments will extend the work done by establishing
how significant interaction effects are to listeners.
|
|
Top
|
[ abstract ]
|
|
Moreno, D.M. and Jackson, P.J.B. (2003).
A front end using periodic and aperiodic streams for ASR.
In Proceedings of One-day Meeting for Young Speech
Researchers, OMYSR 2003,
p. 18, London.
Abstract:
Various acoustic mechanisms produce cues in human speech,
such as voicing, frication and plosion. Automatic speech recognition
(ASR) front ends often treat them alike, although studies demonstrate the
dependence of their signal characteristics on the presence or absence of
vocal-fold vibration. Typically, Mel-frequency cepstral coefficients
(MFCCs) are used to extract features that are not strongly influenced by
source characteristics. In contrast, harmonic and noise-like cues were
segregated before characterisation, by separating the contribution of
voicing from those of other acoustic sources to improve feature extraction
for both parts. The pitch-scaled harmonic filter (PSHF) divides an input
speech signal into two synchronous streams: periodic and aperiodic,
respective estimates of voiced and unvoiced components of the signal at
any time. In digit-recognition experiments with the Aurora 2.0 database
(clean and noisy conditions, 4kHz bandwidth), features were extracted from
each of the decomposed streams, then combined (by concatenation or further
manipulation) into an extended feature vector. Thus, the noise robustness
of our parameterisation was compared against a conventional one (39 MFCCs,
deltas, delta-deltas). Each separate stream reduced recognition accuracy
by less than 1% absolute, compared to the baseline on the original speech;
combined, they increased accuracy under noisy conditions (by 7.8% under
5dB SNR, after multi-condition training). Voiced regions provided
resilience to corruption by noise. However, no significant improvement on
99.0% baseline accuracy was achieved under clean test conditions.
Principal component analysis (PCA) of concatenated features tended to
perform better than of the separate streams, and PCA of static
coefficients better than after calculation of deltas. With PCA of
concatenated static MFCCs, plus deltas, the improvement was 5.6%, implying
some redundancy between the complementary streams. Future plans to
evaluate the PSHF front end for phoneme recognition with higher bandwidth
could help to identify the source of these substantial performance
benefits.
|
|
Top
|
[ abstract |
slides ]
|
|
Jackson, P.J.B., Lo, B.-H. and Russell, M.J. (2002).
Models of speech dynamics for ASR, using intermediate linear
representations.
Presented at NATO Advanced Study Institute on the Dynamics of Speech Production and
Perception,
Il Ciocco, Italy.
Abstract:
A theoretical and experimental analysis of a simple multi-level segmental HMM
is presented in which the relationship between symbolic (phonetic) and surface
(acoustic) representations of speech is regulated by an intermediate
(articulatory) layer, where speech dynamics are modeled using linear
trajectories.
Three formant-based parameterizations and measured articulatory
positions are considered as intermediate representations, from the TIMIT and
MOCHA corpora respectively.
The articulatory-to-acoustic mapping was performed by between 1 and 49 linear
transformations.
Results of phone-classification experiments demonstrate that, by appropriate
choice of intermediate parameterization and mappings, it is possible to
achieve close to optimal performance.
|
|
Top
|
[ abstract |
ppt ]
|
|
Jackson, P.J.B. (2002).
Mama and papa: the ancestors of modern-day speech science.
In Proceedings of the International Conference and Commemoration of the
Bicentenary of the Death of Erasmus Darwin, ICCBDED,
p. 14, Lichfield, UK.
Abstract:
Erasmus Darwin's writings on the subject of human speech included discussion
of the alphabet as an unsatisfactory phonetic representation of the spoken
word, of mechanisms of speech production and, indeed, of a mechanical
speaking machine [1,2].
His studies of the acoustic properties of speech were limited, as
it was not until many generations later that the physical behaviour of sound
waves began to be understood in any detail [3].
Nevertheless, his analysis of sounds on the basis of their manner of
production and place of articulation was highly insightful, and is comparable
to the classification scheme laid down by the International Phonetic
Association.
Furthermore, the wooden and leather device he had built was capable of
pronouncing the vowel /a/ and labial consonants which, in English, are /p/,
/b/ and /m/.
These could be combined to create some simple utterances, as in my title.
This paper will examine many of the technical aspects of Darwin's
investigations into the nature of speech, and relate them to the
findings of contemporary research in the field.
In particular, it will review the application of articulatory information
in approaches to speech synthesis, and show how magnetic resonance images,
together with a model of the vocal-tract acoustics, can
be used for such purposes.
Where appropriate, demonstrations will be given, to illustrate the different
aspects of the technology, and connexions will be made between those aspects
that Darwin brought to light and what speech science knows of them now.
References:
- Darwin, Erasmus (1803), "The Temple of Nature", J. Johnson, London,
Add. Note XV:107-120.
- King-Hele, Desmond (1981), "The Letters of Erasmus Darwin", Cambridge
University Press, Cambridge, UK.
- Lord Rayleigh (1877), "The Theory of Sound", 2nd edition, Dover, New
York.
Session:
Erasmus Darwin and technology
|
|
Top
|
[ abstract |
ppt ]
|
|
Jackson, P.J.B. and Shadle, C.H. (1999).
Modelling vocal-tract acoustics validated by flow experiments.
Journal of the Acoustical Society of America,
Vol. 105 (2, Pt. 2), p. 1161, Berlin, Germany (abstract).
Abstract:
Modelling the acoustic response of the vocal tract is a complex task, both from
the point of view of acquiring details of its internal geometry and of accounting
for the acoustic-flow interactions.
A vocal-tract acoustics program (VOAC) has been developed [P.
Davies, R. McGowan & C. Shadle, Vocal Fold Phys., ed. I. Titze, San
Diego: Singular Pub., 93-142 (1993)], which uses a more realistic,
aeroacoustic model of the vocal tract than classic electrical-analogue
representations.
It accommodates area and hydraulic radius profiles, smooth and abrupt area
changes, incorporating end-corrections, side-branches, and net fluid flows,
including turbulence losses incurred through jet formation.
Originally, VOAC was tested by comparing vowel formant frequencies (i) uttered
by subjects, (ii) predicted using classic electrical analogues, and (iii)
predicted by VOAC.
In this study, VOAC is further validated by comparing the predicted frequency
response functions for a range of flow rates with measurements of the radiated
sound from a series of mechanical models of unvoiced fricatives [C. Shadle,
PhD thesis, MIT-RLE Tech. Rpt. 506 (1985)].
Results show VOAC is more accurate in predicting the complete spectrum at a
range of flow rates.
Finally, preliminary work is presented with VOAC used to simulate the sound
generated at a sequence of stages during the release of a plosive.
|
|
Top
|
[ abstract ]
|
|
Jackson, P.J.B. and Shadle, C.H. (1999).
Analysis of mixed-source speech sounds:
aspiration, voiced fricatives and breathiness.
In Proceedings of the 2nd International Conference on Voice Physiology
and Biomechanics, p. 30, Berlin, Germany (abstract).
Abstract:
Our initial goal was to model the source characteristics of aspiration more
accurately. The term is used inconsistently in the literature, but there is
general agreement that aspiration is produced by turbulence noise generated
in the vicinity of the glottis. Thus, in order to model aspiration, we must refine
its concept, and in particular define its relation to other kinds of noise
produced near the glottis, such as breathiness and hoarseness. For instance,
do similar aeroacoustic processes operate transiently during a plosive release
and steadily during a breathy vowel? In unvoiced fricatives, localized sources
produce well-defined spectral troughs. We have therefore developed a series
of analysis methods that generate spectra for transient and
voice-and-noise-excited sounds. These methods include pitch-synchronous
decomposition into harmonic and anharmonic components (based on a
hoarseness metric of Muta et al., 1988), short-time spectra, ensemble
averaging, and short-time harmonics-to-noise ratios (Jackson and Shadle,
1998). These have been applied to a corpus of repeated nonsense words
consisting of aspirated stops in three vowel contexts and voiced and unvoiced
fricatives, spoken in four voice qualities, thus providing multiple examples of
mixed-source and transient-source speech sounds. Ensemble-averaged
spectra derived throughout a stop release show evidence of a
highly-localized noise source becoming more distributed. Variations by place
are also apparent, complementing and extending previous work (Stevens and
Blumstein, 1978; Stevens, 1993). The coordination of glottal and supraglottal
articulation, described and modelled for aspiration by Scully and Mair (1995),
is in a sense reversed for voiced fricatives. Use of the decomposition algorithm
on voiced fricatives revealed greater complexity than expected: the
anharmonic component appears sometimes to be modulated by the harmonic
component, sometimes to be independent of it, and tends to change from one
case to the other in the course of the fricative. In sum, we have made some
progress in describing not only spectral but time-varying properties of an
aspiration model, and in so doing, have improved our descriptions of other
mixed-source, time-varying speech sounds.
|
|
Top
|
[ abstract ]
|
|
|