|
Abstracts of my publications
|
|
|
Listing
Journal papers
JASA, 2006
CSL, 2005
El. Lett., 2002
IEEE-SAP, 2001
JASA, 2000
Conferences
Interspeech 2006
Interspeech 2005
AES 2005
OMYSR 2005
3DPVT 2004
FSTS 2004
ASA 2004
OMYSR 2004
Eurospeech 2003
ICPhS 2003
EC-VIP-MC 2003
OMYSR 2003
ICSLP 2002
ASI 2002
ICCBDED 2002
CRAC 2001
WISP 2001
ICASSP 2000
SPS5 2000
ICPhS 1999
ASA-EAA 1999
ICVPB 1999
ICA-ASA 1998
ASME 1996
Book chapter
G of ED, 2005
Doctoral thesis
PhD, 2000
FTP
site
|
Academic Journal Papers
Russell, M.J., X. Zheng and Jackson, P.J.B. (2007).
Modelling speech signals using formant frequencies as an intermediate representation.
IET Signal Processing,
Vol. 1 (1), pp. 43-50.
Abstract:
Multiple-level segmental hidden Markov models (M-SHMMs) in which the relationship between symbolic and acoustic representations of speech is regulated by a formant-based intermediate representation are considered. New TIMIT phone recognition results are presented, confirming that the theoretical upper-bound on performance is achieved provided that either the intermediate representation or the formant-to-acoustic mapping is sufficiently rich. The way in which M-SHMMs exploit formant-based information is also investigated, using singular value decomposition of the formant-to-acoustic mappings and linear discriminant analysis. The analysis shows that if the intermediate layer contains information which is linearly related to the spectral representation, that information is used in preference to explicit formant frequencies, even though the latter are useful for phone discrimination. In summary, although these results confirm the utility of M-SHMMs for automatic speech recognition, they provide empirical evidence of the value of nonlinear formant-to-acoustic mappings.
INSPEC codes: A4370; B6130E; C5260S; A4360; A0210; A0250; B0210; B0240J; C1110; C1140J
More:
Please contact Philip Jackson if you would like
further information.
Pincas, J. and Jackson, P.J.B. (2006).
Amplitude modulation of turbulence noise by voicing in fricatives.
Journal of the Acoustical Society of America,
Vol. 120 (6), pp. 3966-3977.
Abstract:
The two principal sources of sound in speech, voicing and frication, occur simultaneously in voiced fricatives as well as at the vowel-fricative boundary in phonologically voiceless fricatives.
Instead of simply overlapping, the two sources interact.
This paper is an acoustic study of one such interaction effect: the amplitude modulation of the frication component when voicing is present.
Corpora of sustained and fluent-speech English fricatives were recorded and analyzed using a signal-processing technique designed to extract estimates of modulation depth.
Results reveal a pattern, consistent across speaking style, speakers and places of articulation, for modulation at f0 to rise at low voicing strengths and subsequently saturate.
Voicing strength needed to produce saturation varied 60-66 dB across subjects and experimental conditions.
Modulation depths at saturation varied little across speakers but significantly for place of articulation (with [z] showing particularly strong modulation) clustering at approximately 0.4-0.5 (a 40-50% fluctuation above and below unmodulated amplitude); spectral analysis of modulating signals revealed weak but detectable modulation at the second and third harmonics (i.e., 2f0 and 3f0).
PACS numbers: 43.70.Bk, 43.72.Ar
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
preprint - PDF:
raw
(1.0 MB)
]
|
|
Russell, M.J. and Jackson, P.J.B. (2005).
A multiple-level linear/linear segmental HMM with a formant-based
intermediate layer.
Computer Speech and Language,
Vol. 19 (2), pp. 205-225.
Abstract:
A novel multi-level segmental HMM (MSHMM)
is presented in which the relationship between symbolic (phonetic) and
surface (acoustic) representations of speech is regulated by an
intermediate `articulatory' representation.
Speech dynamics are characterised as linear trajectories in the
articulatory space, which are transformed into the acoustic space using
an articulatory-to-acoustic mapping.
Recognition is then performed.
The results of phonetic classification experiments are presented for
monophone and triphone MSHMMs using three formant-based `articulatory'
parameterisations and sets of between 1 and 49 linear
articulatory-to-acoustic mappings.
The NIST Matched Pair Sentence Segment (Word Error) test shows that, for
a sufficiently rich combination of articulatory parameterisation and
mappings, differences between these results and those obtained with an
optimal classifier are not statistically significant. It is also shown
that, compared with a conventional HMM, superior performance can be
achieved using a MSHMM with 25% fewer parameters.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
preprint - PDF:
raw
(380 kB)
]
|
|
Jackson, P.J.B., Lo, B.-H. and Russell, M.J. (2002).
Data-driven, non-linear, formant-to-acoustic mapping for ASR.
IEE Electronics Letters, Vol. 38 (13),
pp. 667-669.
Abstract:
The underlying dynamics of speech can be captured in an
automatic speech recognition system via an articulatory
representation, which resides in a domain other than that of
the acoustic observations.
Thus, given a set of models in this hidden domain, it is
essential that a mapping can be obtained to relate the
intermediate representation to the acoustic domain.
In this paper, two methods for mapping from formants to
short-term spectra are compared: multi-layered perceptrons
(MLPs) and radial-basis function (RBF) networks.
Both are capable of providing non-linear transformations, and
were trained using features extracted from the TIMIT database.
Various schemes for dividing the frames of speech data
according to their phone class were also investigated.
Results showed that the RBF networks performed approximately
10 % better than the MLPs, in terms of the rms error, and
that a classification based on discrete regions of the
articulatory space gave the greatest improvements over a
single network.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
preprint - PDF:
raw
(150 kB)
]
|
|
Jackson, P.J.B. and Shadle, C.H. (2001).
Pitch-scaled estimation of simultaneous voiced and turbulence-noise
components in speech.
IEEE Transactions on Speech and Audio Processing,
Vol. 9 (7),
pp. 713-726.
Abstract:
Almost all speech contains simultaneous contributions from more than
one acoustic source within the speaker's vocal tract.
In this paper we propose a method -
the pitch-scaled harmonic filter (PSHF) -
which aims to separate the voiced and turbulence-noise components of the
speech signal during phonation, based on a maximum likelihood approach.
The PSHF outputs periodic and aperiodic components that are estimates of the
respective contributions of the different types of acoustic source.
It produces four reconstructed time series signals by
decomposing the original speech signal, first, according to amplitude,
and then according to power of the Fourier coefficients.
Thus, one pair of periodic and aperiodic signals is optimized for subsequent
time-series analysis, and another pair for spectral analysis.
The performance of the PSHF algorithm was tested on synthetic signals,
using three forms of disturbance (jitter, shimmer and additive
noise), and the results were used to predict the performance on real
speech.
Processing recorded speech examples elicited latent features from the
signals, demonstrating the PSHF's potential for analysis of mixed-source
speech.
EDICS number: 1-ANLS
Keywords:
Periodic-aperiodic decomposition,
speech modification,
speech pre-processing.
|
Top
|
More:
Please contact Philip Jackson if you would like a copy or
further information.
|
|
Jackson, P.J.B. and Shadle, C.H. (2000).
Frication noise modulated by voicing, as
revealed by pitch-scaled decomposition.
Journal of the Acoustical Society of America, Vol. 108 (4),
pp. 1421-1434.
Abstract:
A decomposition algorithm that uses a pitch-scaled harmonic filter
was evaluated using synthetic signals and applied to mixed-source speech,
spoken by three subjects, to separate the voiced and unvoiced parts.
Pulsing of the noise component was observed in voiced frication,
which was analyzed by complex demodulation of the signal envelope.
The timing of the pulsation, represented by the phase of the anharmonic
modulation coefficient, showed a step change during a vowel-fricative
transition corresponding to the change in location of the sound source
within the vocal tract.
Analysis of fricatives
/ ,
v,
,
z,
,
,
/
demonstrated a relationship between steady-state phase and place, and
f0 glides confirmed that the main cause was a
place-dependent delay.
PACS numbers: 43.70.Bk, 43.72.Ar
|
Top
|
More:
Please contact Philip Jackson if you would like a copy or
further information.
|
|
Refereed Conference Proceedings
Every, M. and Jackson, P.J.B. (2006).
Enhancement of harmonic content of speech based on a dynamic
programming pitch tracking algorithm.
In Proceedings of Interspeech 2006,
4pp., Pittsburgh PA.
Abstract:
For pitch tracking of a single speaker, a common requirement
is to find the optimal path through a set of voiced or voiceless
pitch estimates over a sequence of time frames.
Dynamic programming (DP) algorithms have been applied before to this problem.
Here, the pitch candidates are provided by a multi-channel
autocorrelation-based estimator, and DP is extended to pitch tracking
of multiple concurrent speakers.
We use the resulting pitch information to enhance harmonic content in noisy speech and to obtain separations of target from interfering speech.
Index Terms: speech enhancement, dynamic programming
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
poster -
PDF: raw
(280 kB)
]
|
|
Pincas, J. and Jackson, P.J.B. (2005b).
Amplitude modulation of frication noise by voicing saturates.
In Proceedings of Interspeech 2005,
4pp., Lisbon.
Abstract:
The two distinct sound sources comprising voiced frication, voicing and
frication, interact.
One effect is that the periodic source at the glottis modulates
the amplitude of the frication source originating in the vocal tract above the
constriction.
Voicing strength and modulation depth for frication noise were measured for sustained English voiced fricatives using high-pass filtering, spectral analysis in the
modulation (envelope) domain, and a variable pitch compensation procedure.
Results show a positive relationship between strength of the glottal source
and modulation depth at
voicing strengths below 66 dB SPL, at which point the modulation
index was approximately 0.5 and saturation occurred.
The alveolar [z] was found to be more modulated than other fricatives.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
poster -
PDF: raw
(150 kB)
]
|
|
Dewhirst, M., Zielinski, S., Jackson, P.J.B. and Rumsey F. (2005).
Objective assessment of spatial localisation attributes of surround-sound reproduction systems.
In Proceedings of 118th Convention of the Audio Engineering Society,
AES 2005,
16pp., Barcelona, Spain.
Abstract:
A mathematical model for objective assessment of perceived spatial quality was
developed for comparison across the listening area of various sound reproduction
systems: mono, two-channel stereo (TCS), 3/2 stereo (i.e., 5.0 surround sound),
Wave Field Synthesis (WFS) and Higher Order Ambisonics (HOA).
Models for mono, TCS and 3/2 stereo are based on conventional microphone
techniques and loudspeaker configurations for each system.
WFS and HOA models use circular arrays of thirty-two loudspeakers driven by
signals derived from a virtual microphone array and the Fourier-Bessel spatial
decomposition of the soundfield respectively.
Directional localisation, ensemble width and ensemble envelopment of
monochromatic tones,
extracted from binaural signals, are analysed under a range of test conditions.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
poster -
PDF: raw
(1.7 MB)
]
|
|
Pincas, J. and Jackson, P.J.B. (2005a).
Amplitude profiles of fricatives described by temporal moments.
In Proceedings of One-day Meeting for Young Speech Researchers,
OMYSR 2005,
p. 12, London.
Abstract:
As well as the rapid fluctuations in amplitude that make up the `fine structure'
of noise, various degrees of slower loudness change, or envelope fluctuation ,
are present in fricative sounds.
In voiced fricatives, noise is generally amplitude modulated by the voicing
component, resulting in a periodic pulsing [Pincas and Jackson 2004, Proc. of
From Sound to Sense , MIT, 73- 78].
In addition, all fricatives display some build up and decay of noise power from
frication onset to offset.
This paper focuses on these latter amplitude changes, which we term amplitude
profiles.
Frication build-up and decay for an 8-speaker corpus of intervocalic fricatives
was investigated by treating their amplitude profiles as statistical
distributions whose properties are fully specified by their first four standard
moments: mean, standard deviation, skewness and kurtosis (`peakiness').
This is an adaptation of the spectral moments technique previously used to
describe the main features of fricative spectra [Jongman et al. 2000, JASA
108(3):1252-1263].
Analysis of these temporal moments shows that the sibilant/non-sibilant split is
consistently manifested in the `flatness' of profiles, whereas voicing status
has more effect on whether build-up is skewed towards the beginning or the end
of the fricative.
These acoustic results are examined in light of probable articulatory
explanations.
The perceptual significance of amplitude profiles is also discussed.
It is known, for example, that the temporal acuity of the auditory system is
good enough to distinguish even very fast amplitude fluctuations [Viemeister
1990, JASA 88(3):1367-1373], but it is unclear to what extent differences in
profiles could function as a linguistic cue or naturalness enhancer.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
poster -
PDF: raw
(1.1 MB)
]
|
|
Ypsilos, I.A., Hilton, A., Turkmani, A. and Jackson, P.J.B. (2004).
Speech-driven face synthesis from 3D video.
In IEEE Proceedings of
the 2nd International Symposium on 3D Data Processing, Visualization and
Transmission (3DPVT'04),
pp. 58-65, Thessaloniki, Greece.
Abstract:
This paper presents a framework for speech-driven synthesis of real faces
from a corpus of 3D video of a person speaking.
Video-rate capture of dynamic 3D face shape and colour appearance
provides the basis for a visual speech synthesis model.
A displacement map representation combines face shape and colour into
a 3D video.
This representation is used to efficiently register and integrate shape and
colour information captured from multiple views.
To allow visual speech synthesis viseme primitives are identified from the
corpus using automatic speech recognition.
A novel non-rigid alignment algorithm is introduced to estimate dense
correspondence between 3D face shape and appearance for different
visemes.
The registered displacement map representation together with a novel
optical flow optimisation using both shape and colour, enables accurate
and efficient non-rigid alignment.
Face synthesis from speech is performed by concatenation of the
corresponding viseme sequence using the non-rigid correspondence
to reproduce both 3D face shape and colour appearance.
Concatenative synthesis reproduces both viseme timing and
co-articulation.
Face capture and synthesis has been performed for a database of 51
people.
Results demonstrate synthesis of 3D visual speech animation with a
quality comparable to the captured video of a person.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(560 kB)
]
|
|
Pincas, J. and Jackson, P.J.B. (2004).
Acoustic correlates of voicing-frication interaction in fricatives.
In Proceedings of From Sound to Sense,
J Slifka, S Manuel and M Matthies (eds.),
pp. C73-C78, Cambridge MA.
Abstract:
This paper investigates the acoustic effects of source interaction in fricative
speech sounds.
A range of parameters has been employed, including a measure designed
specifically to describe quantitatively the amplitude modulation of frication
noise by voicing, a phenomenon which has mainly been qualitatively
reported.
The signal processing technique to extract this measure is presented.
Results suggest that fricative duration is the main determinant of how much
the sources overlap at the VF boundary of voiceless fricatives and that the
amount of modulation occurring in voiced fricatives is chiefly dependent on
voicing strength.
Furthermore, it appears that individual speakers have differing tendencies
for amount of source-source overlap and degree of modulation where
overlap does occur.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(410 kB)
]
|
|
Jackson, P.J.B., Jesus, L.M.T., Shadle, C.H. and Pincas, J. (2004).
Measures of voiced frication for automatic classification.
Journal of the Acoustical Society of America,
Vol. 115 (5, Pt. 2), p. 2429, New York NY (abstract).
Abstract:
As an approach to understanding the characteristics of the acoustic
sources in voiced fricatives, it seems apt to draw on knowledge of vowels
and voiceless fricatives, which have been relatively well studied.
However,
the presence of both phonation and frication in these mixed-source
sounds offers the possibility of mutual interaction effects, with
variations across place of articulation.
This paper examines the acoustic and articulatory consequences
of these interactions and to explore automatic techniques for finding
parametric and statistical descriptions of these phenomena.
A reliable and consistent set of such acoustic cues
could be used for phonetic classification or speech recognition.
Following work on devoicing of European Portuguese voiced fricatives
[Jesus & Shadle, In Mamede, et al. (Eds.),
pp. 1-8, Berlin: Springer-Verlag, 2003]
and the modulating effect of voicing on frication
[Jackson & Shadle, JASA, 108(4): 1421-1434, 2000],
the present study focuses on three types of information:
(i) sequences and durations of acoustic events in VC transitions,
(ii) temporal, spectral and modulation measures from the periodic and
aperiodic components of the acoustic signal, and
(iii) voicing activity derived from simultaneous EGG data.
Analysis of interactions observed in British/American English and
European Portuguese speech corpora will be compared, and the principal
findings discussed.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
|
|
Russell, M.J. and Jackson, P.J.B. (2004).
Regularized re-estimation of stochastic duration models.
Journal of the Acoustical Society of America,
Vol. 115 (5, Pt. 2), p. 2429, New York NY (abstract).
Abstract:
Recent research has compared the performance of various distributions
(uniform, boxcar, exponential, gamma, discrete) for modeling segment
(state) durations in hidden semi-Markov models used for phone
classification on the TIMIT database.
These experiments have shown that a gamma distribution is more
appropriate than exponential (which is implicit in first-order Markov
models), and achieved a 3% relative reduction in
phone-classification errors
[Jackson, Proc. ICPhS, 1349-1352, 2003].
The parameters of these duration distributions were estimated once
for each model from initial statistics of state occupation (offline),
and remained unchanged during subsequent iterations of training.
The present work investigates the effect of re-estimating the
duration models in training (online) with respect to the
phone-classification scores.
First, tests were conducted on duration models re-estimated directly
from statistics gathered in the previous iteration of training.
It was found that the boxcar and gamma models were unstable,
meanwhile the performance of the other models also tended to degrade.
Secondary tests, using a scheme of annealed regularization,
demonstrated that the losses could be recouped and a further 1%
improvement was obtained.
The results from this pilot study imply that similar gains in
recognition accuracy deserve investigation, along with further
optimization of the duration model re-estimation procedure.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
|
|
Pincas, J. and Jackson, P.J.B. (2004).
Quantifying voicing-frication interaction effects in voiced and voiceless fricatives.
In Proceedings of One-day Meeting for Young Speech
Researchers, OMYSR 2004,
p. 27, London.
Abstract:
Although speech does not, in general, switch cleanly between periodic and
aperiodic noise sources, regions of mixed source sound have received little
attention: aerodynamic treatments of source production mechanisms show
that interaction will result in decreased amplitude of both sources, and limited
previous research has suggested some spectral modification of frication
sources by voicing.
In this paper, we seek to extend current knowledge of voicing-frication
interaction by applying a wider range of measures suitable for quantifying
interaction effects to a specially recorded corpus of /VFV/ sequences.
We present data for one male and one female subject (from a total of 8).
Regions of voicing-frication overlap at the onset of voiceless fricatives often
show interaction effects.
The extent of such overlapping source regions is investigated with durational
data.
We have created a measure designed to quantify the magnitude of
modulation where overlap does occur, in both these areas and in fully voiced
fricatives.
We employ high-pass filtering and short-time smoothing to produce an
envelope which characterises temporal fluctuation of the aperiodic
component.
Periodicity at or around the fundamental frequency is interpreted as
modulation of frication by voicing, and magnitude of amplitude modulation is
computed with spectral analysis of the envelope.
Further statistical techniques have been employed to describe the profile of
aperiodic sound generation over the course of the fricative.
In addition to the above, gradients of f0 contours in VF transitions and total
duration of frication are analysed.
Results are compared across the voiced/voiceless distinction and place of
articulation.
Source overlap and interaction effects are often ignored in synthesis
systems; thus findings from this paper could potentially be used to improve
naturalness of synthetic speech.
Planned perceptual experiments will extend the work done by establishing
how significant interaction effects are to listeners.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
|
|
Jackson, P.J.B., Moreno, D.M., Russell, M.J. and Hernando, J. (2003).
Covariation and weighting of harmonically decomposed streams for ASR.
In Proceedings of Eurospeech 2003,
pp. 2321-2324, Geneva.
Abstract:
Decomposition of speech signals into simultaneous streams of periodic and
aperiodic information has been successfully applied to speech analysis,
enhancement, modification and recently recognition.
This paper examines the effect of different weightings of the two streams
in a conventional HMM system in digit recognition tests on the Aurora 2.0
database.
Comparison of the results from using matched weights during training showed a
small improvement of approximately 10% relative to unmatched ones,
under clean test conditions.
Principal component analysis of the covariation amongst the periodic and
aperiodic features indicated that only 45 (51) of the 78 coefficients were
required to account for 99% of the variance, for clean (multi-condition)
training, which yielded an 18.4% (10.3%) absolute increase in accuracy with
respect to the baseline.
These findings provide further evidence of the potential for
harmonically-decomposed streams to improve performance and
substantially to enhance recognition accuracy in noise.
Session:
OWeDc, Speech Modeling & Features 2 (oral).
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(128 kB)
|
presentation -
PPT:
raw
(870 kB)
]
|
|
Russell, M.J. and Jackson, P.J.B. (2003).
The effect of an intermediate articulatory layer on the performance of
a segmental HMM.
In Proceedings of Eurospeech 2003,
pp. 2737-2740, Geneva.
Abstract:
We present a novel multi-level HMM in which an intermediate `articulatory'
representation is included between the state and surface-acoustic levels.
A potential difficulty with such a model is that advantages gained by the
introduction of an articulatory layer might be compromised by limitations
due to an insufficiently rich articulatory representation, or by
compromises made for mathematical or computational expediency. This paper
decribes a simple model in which speech dynamics are modelled as linear
trajectories in a formant-based `articulatory' layer, and the
articulatory-to-acoustic mappings are linear. Phone classification
results for TIMIT are presented for monophone and triphone systems with a
phone-level syntax. The results demonstrate that provided the
intermediate representation is sufficiently rich, or a sufficiently large
number of phone-class-dependent articulatory-to-acoustic mapping are
employed, classification performance is not compromised.
Session:
PThBf, Robust Speech Recognition 3 (poster).
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(87 kB)
]
|
|
Jackson, P.J.B. (2003).
Improvements in phone-classification accuracy from modelling duration.
In Proceedings of the 15th International Congress of Phonetic
Sciences, ICPhS 2003,
pp. 1349-1352, Barcelona.
Abstract:
Durations of real speech segments do not generally exhibit exponential
distributions, as modelled implicitly by the state transitions of Markov
processes. Several duration models were considered for integration within a
segmental-HMM recognizer: uniform, exponential, Poisson, normal, gamma and
discrete. The gamma distribution fitted that measured for silence best, by an
order of magnitude. Evaluations determined an appropriate weighting for duration
against the acoustic models. Tests showed a reduction of 2% absolute (6+%
relative) in the phone-classification error rate with gamma and discrete models;
exponential ones gave approximately 1% absolute reduction, and uniform no
significant improvement. These gains in performance recommend the wider
application of explicit duration models.
[http://www.ee.surrey.ac.uk/Personal/P.Jackson/Balthasar/]
Session:
T.3.P2, Automatic speech recognition / Auditory mechanisms (poster).
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(155 kB)
|
poster -
PS:
gzip
(574 kB)
-
PDF:
raw
(248 kB)
]
|
|
Moreno, D.M., Jackson, P.J.B., Hernando, J. and Russell, M.J. (2003).
Improved ASR in noise using harmonic decomposition.
In Proceedings of the 15th International Congress of Phonetic
Sciences, ICPhS 2003,
pp. 751-754, Barcelona.
Abstract:
Application of the pitch-scaled harmonic filter (PSHF) to automatic speech
recognition in noise was investigated using the Aurora 2.0 database.
The PSHF decomposed the original speech into periodic and aperiodic streams.
Digit-recognition tests with the extended features compared the noise robustness
of various parameterisations against standard 39 MFCCs. Separately, each stream
reduced word accuracy by less than 1% absolute; together, the combined streams
gave substantial increases under noisy conditions. Applying PCA to concatenated
features proved better than to separate streams, and to static coefficients better
than after calculation of deltas. With multi-condition training, accuracy
improved by 7.8% at 5dB SNR, thus providing resilience from corruption by noise.
[http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/]
Session:
M.4.5, Automatic speech recognition I (oral).
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(191 kB)
|
presentation -
PPS:
raw
(1.3 MB)
]
|
|
Russell, M.J., Jackson, P.J.B. and Wong, M.L.P. (2003).
Development of articulatory-based multi-level segmental HMMs for phonetic
classification in ASR.
In Proceedings of EURASIP Conference on Video/Image Processing and
Multimedia Communications,
EC-VIP-MC~2003, Vol. 2, pp. 655-660, Zagreb, Croatia.
Abstract:
A simple multiple-level HMM is presented in which speech dynamics are modelled
as linear trajectories in an intermediate, formant-based representation and
the mapping between the intermediate and acoustic data is achieved using one
or more linear transformations. An upper-bound on the performance of such a
system is established. Experimental results on the TIMIT corpus demonstrate
that, if the dimension of the intermediate space is sufficiently high or the
number of articulatory-to-acoustic mappings is sufficiently large, then this
upper-bound can be achieved.
Keywords:
Automatic speech recognition, Hidden Markov Models, segment models.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(229 kB)
]
|
|
Moreno, D.M. and Jackson, P.J.B. (2003).
A front end using periodic and aperiodic streams for ASR.
In Proceedings of One-day Meeting for Young Speech
Researchers, OMYSR 2003,
p. 18, London.
Abstract:
Various acoustic mechanisms produce cues in human speech,
such as voicing, frication and plosion. Automatic speech recognition
(ASR) front ends often treat them alike, although studies demonstrate the
dependence of their signal characteristics on the presence or absence of
vocal-fold vibration. Typically, Mel-frequency cepstral coefficients
(MFCCs) are used to extract features that are not strongly influenced by
source characteristics. In contrast, harmonic and noise-like cues were
segregated before characterisation, by separating the contribution of
voicing from those of other acoustic sources to improve feature extraction
for both parts. The pitch-scaled harmonic filter (PSHF) divides an input
speech signal into two synchronous streams: periodic and aperiodic,
respective estimates of voiced and unvoiced components of the signal at
any time. In digit-recognition experiments with the Aurora 2.0 database
(clean and noisy conditions, 4kHz bandwidth), features were extracted from
each of the decomposed streams, then combined (by concatenation or further
manipulation) into an extended feature vector. Thus, the noise robustness
of our parameterisation was compared against a conventional one (39 MFCCs,
deltas, delta-deltas). Each separate stream reduced recognition accuracy
by less than 1% absolute, compared to the baseline on the original speech;
combined, they increased accuracy under noisy conditions (by 7.8% under
5dB SNR, after multi-condition training). Voiced regions provided
resilience to corruption by noise. However, no significant improvement on
99.0% baseline accuracy was achieved under clean test conditions.
Principal component analysis (PCA) of concatenated features tended to
perform better than of the separate streams, and PCA of static
coefficients better than after calculation of deltas. With PCA of
concatenated static MFCCs, plus deltas, the improvement was 5.6%, implying
some redundancy between the complementary streams. Future plans to
evaluate the PSHF front end for phoneme recognition with higher bandwidth
could help to identify the source of these substantial performance
benefits.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
poster -
PS: raw
(1.5 MB),
gzip
(248 kB)
]
|
|
Jackson, P.J.B. and Russell, M.J. (2002).
Models of speech dynamics in a segmental-HMM recognizer
using intermediate linear representations.
In Proceedings of the International Conference on Spoken Language
Processing, ICSLP 2002,
pp. 1253-1256, Denver CO.
Abstract:
A theoretical and experimental analysis of a simple multi-level segmental HMM
is presented in which the relationship between symbolic (phonetic) and surface
(acoustic) representations of speech is regulated by an intermediate
(articulatory) layer, where speech dynamics are modeled using linear
trajectories.
Three formant-based parameterizations and measured articulatory
positions are considered as intermediate representations, from the TIMIT and
MOCHA corpora respectively.
The articulatory-to-acoustic mapping was performed by between 1 and 49 linear
transformations.
Results of phone-classification experiments demonstrate that, by appropriate
choice of intermediate parameterization and mappings, it is possible to
achieve close to optimal performance.
Session:
Acoustic modelling
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(110 kB) -
PS:
raw
(480 kB)
|
presentation - PPT:
raw
(520 kB)
]
|
|
Jackson, P.J.B., Lo, B.-H. and Russell, M.J. (2002).
Models of speech dynamics for ASR, using intermediate linear
representations.
Presented at NATO Advanced Study Institute on the Dynamics of Speech Production and
Perception,
Il Ciocco, Italy.
Abstract:
A theoretical and experimental analysis of a simple multi-level segmental HMM
is presented in which the relationship between symbolic (phonetic) and surface
(acoustic) representations of speech is regulated by an intermediate
(articulatory) layer, where speech dynamics are modeled using linear
trajectories.
Three formant-based parameterizations and measured articulatory
positions are considered as intermediate representations, from the TIMIT and
MOCHA corpora respectively.
The articulatory-to-acoustic mapping was performed by between 1 and 49 linear
transformations.
Results of phone-classification experiments demonstrate that, by appropriate
choice of intermediate parameterization and mappings, it is possible to
achieve close to optimal performance.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
poster -
PPT: raw
(466 kB)
]
|
|
Jackson, P.J.B. (2002).
Mama and papa: the ancestors of modern-day speech science.
In Proceedings of the International Conference and Commemoration of the
Bicentenary of the Death of Erasmus Darwin, ICCBDED,
p. 14, Lichfield, UK.
Abstract:
Erasmus Darwin's writings on the subject of human speech included discussion
of the alphabet as an unsatisfactory phonetic representation of the spoken
word, of mechanisms of speech production and, indeed, of a mechanical
speaking machine [1,2].
His studies of the acoustic properties of speech were limited, as
it was not until many generations later that the physical behaviour of sound
waves began to be understood in any detail [3].
Nevertheless, his analysis of sounds on the basis of their manner of
production and place of articulation was highly insightful, and is comparable
to the classification scheme laid down by the International Phonetic
Association.
Furthermore, the wooden and leather device he had built was capable of
pronouncing the vowel /a/ and labial consonants which, in English, are /p/,
/b/ and /m/.
These could be combined to create some simple utterances, as in my title.
This paper will examine many of the technical aspects of Darwin's
investigations into the nature of speech, and relate them to the
findings of contemporary research in the field.
In particular, it will review the application of articulatory information
in approaches to speech synthesis, and show how magnetic resonance images,
together with a model of the vocal-tract acoustics, can
be used for such purposes.
Where appropriate, demonstrations will be given, to illustrate the different
aspects of the technology, and connexions will be made between those aspects
that Darwin brought to light and what speech science knows of them now.
References:
- Darwin, Erasmus (1803), "The Temple of Nature", J. Johnson, London,
Add. Note XV:107-120.
- King-Hele, Desmond (1981), "The Letters of Erasmus Darwin", Cambridge
University Press, Cambridge, UK.
- Lord Rayleigh (1877), "The Theory of Sound", 2nd edition, Dover, New
York.
Session:
Erasmus Darwin and technology
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
abstract -
DOC: raw
(21 kB)
|
presentation - PPT:
raw
(1.1 MB)
]
|
|
Jackson, P.J.B. (2001).
Acoustic cues of voiced and voiceless plosives
for determining place of articulation.
In Proceedings of Workshop on
Consistent and Reliable Acoustic Cues
for sound analysis, CRAC 2001, pp. 19-22, Aalborg, Denmark.
Abstract:
Speech signals from stop consonants with trailing vowels were
analysed for cues consistent with their place of articulation.
They were decomposed into periodic and aperiodic components
by the pitch-scaled harmonic filter to improve the quality of
the formant tracks, to which exponential trajectories were fitted
to get robust formant loci at voice onset.
Ensemble-average power spectra of the bursts exhibited dependence on place
(and on vowel context for velar consonants), but not on voicing.
By extrapolating the trajectories back to the release time, formant
estimates were compared with spectral peaks, and connexions
were made between these disparate acoustic cues.
Keywords:
acoustic cues, plosive, stop consonants.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(320 kB) -
PS:
raw
(560 kB)
]
|
|
Jackson, P.J.B. and Shadle, C.H. (2001).
Uses of the pitch-scaled harmonic
filter in speech processing.
In Proceedings of the Institute of Acoustics, Workshop on Innovation in
Speech Processing 2001, Vol. 23 (3),
pp. 309-321, Stratford-upon-Avon, UK.
Abstract:
The pitch-scaled harmonic filter (PSHF) is a technique for decomposing speech
signals into their periodic and aperiodic constituents, during periods of
phonation.
In this paper, the use of the PSHF for speech analysis and processing tasks is
described.
The periodic component can be used as an estimate of the part attributable
to voicing,
and the aperiodic component can act as an estimate of that
attributable to turbulence noise, i.e., from fricative, aspiration and plosive
sources.
Here we present the algorithm for separating the periodic and aperiodic
components from the pitch-scaled Fourier transform of a short section of
speech, and show how to derive signals suitable for time-series analysis and
for spectral analysis.
These components can then be processed in a manner appropriate to their
source type, for instance, extracting zeros as well as poles from the
aperiodic spectral envelope.
A summary of tests on synthetic speech-like signals demonstrates the
robustness of the PSHF's performance to perturbations from additive noise,
jitter and shimmer.
Examples are given of speech analysed in various ways:
power spectrum, short-time power and short-time harmonics-to-noise ratio,
linear prediction and mel-frequency cepstral coefficients.
Besides being valuable for speech production and perception studies, the
latter two analyses show potential for incorporation into speech coding and
speech recognition systems.
Further uses of the PSHF are revealing normally-obscured acoustic
features, exploring interactions of turbulence-noise sources with
voicing, and pre-processing speech to enhance subsequent operations.
Keywords:
periodic/aperiodic decomposition, acoustic features.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
presentation -
PPT: raw
(470 kB)
|
paper -
PDF: raw
(1.3 MB) -
PS:
raw
(4.3 MB),
gzip
(470 kB)
]
|
|
Jackson, P.J.B. and Shadle, C.H. (2000).
Performance of the pitch-scaled harmonic
filter and applications in speech analysis.
In Proceedings of IEEE International Conference on Acoustics, Speech
and Signal Processing, Vol. 3, pp. 1311-1314,
Istanbul.
Abstract:
The pitch-scaled harmonic filter (PSHF) is a technique for decomposing speech
signals into their voiced and unvoiced constituents.
In this paper, we evaluate its ability to reconstruct the
time series of the two components accurately using a variety of synthetic,
speech-like signals, and discuss its performance.
These results determine the degree of confidence that can be expected
for real speech signals: typically, 5 dB improvement in the
signal-to-noise ratio of the harmonic component and approximately
5 dB more than the initial harmonics-to-noise ratio (HNR) in the anharmonic
component.
A selection of the analysis opportunities that the decomposition offers
is demonstrated on speech recordings, including dynamic HNR estimation
and separate linear prediction analyses of the two components.
These new capabilities provided by the PSHF can facilitate
discovering previously hidden features and investigating interactions of
unvoiced sources, such as frication, with voicing.
Session:
3.2 Speech analysis
Keywords:
harmonics-to-noise ratio, voiced/unvoiced
decomposition, frication, aspiration noise.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(580 kB),
zip
(440 kB),
gzip
(440 kB) -
PS:
gzip
(170 kB)
]
|
|
Jackson, P.J.B. and Shadle, C.H. (2000).
Aero-acoustic modelling of voiced and unvoiced
fricatives based on MRI data.
In Proceedings of the 5th Seminar on Speech Production,
pp. 185-188, Seeon, Germany.
Abstract:
We would like to develop a more realistic production model of
unvoiced speech sounds, namely fricatives, plosives and aspiration noise.
All three involve turbulence noise generation, with place-dependent
source characteristics that vary with time (rapidly, in plosives).
In this study, we aimed to produce, using an aero-acoustic model of the
vocal-tract filter and source, voiced as well as unvoiced fricatives
that provide a good match to analyses of speech recordings.
The vocal-tract transfer function (VTTF) was computed by the vocal-tract
acoustics program, VOAC [Davies, McGowan and Shadle. Vocal Fold
Physiology: Frontiers in Basic Science, ed. Titze, Singular Pub., CA, 93-142,
1993], using geometrical data, in the form of
cross-sectional area and hydraulic radius functions, along the length of the
tract.
VOAC incorporates the effects of net flow into the transmission of plane
waves through a tubular representation of the tract, and relaxes assumptions
of rrigid walls and isentropic propagation.
The geometry functions were derived from multiple-slice, dynamic, magnetic
resonance images (MRI) [Mohammad. PhD thesis, Dept. ECS, U. Southampton, UK,
1999; Shadle, Mohammad, Carter, and Jackson. Proc. ICPhS, S.F. CA, 1:623-626,
1999], using a method of converting from the pixel
outlines that was improved over earlier efforts on vowels.
A coloured noise source signal was combined with the VTTF and radiation
characteristic to synthesize the unvoiced fricative [s].
For its voiced counterpart [z], many researchers have noted that the noise
source appears to be modulated by voicing.
Furthermore, the phase of the modulation has been shown to be perceptually
significant.
Based on our analysis [Jackson and Shadle. Proc. IEEE-ICASSP, Istanbul, 2000.]
of recordings by the same subject, the frication source of [z] was varied
periodically according to fluctuations in the flow velocity at the constriction
exit, and the modulation phase was governed by the convection time for the flow
perturbation to travel from the constriction to the obstacle.
The synthesized fricatives were compared to the speech recordings in a simple
listening test, and comparisons of the predicted and measured time series
suggested that the model, which brings together physical, aerodynamic and
acoustic information, can replicate characteristics of real speech, such as
the modulation in voiced fricatives
(please note the change of URL, Nov '02:
http://www.ee.surrey.ac.uk/Personal/P.Jackson/Nephthys/).
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(290 kB),
zip
(180 kB),
gzip
(180 kB) -
PS:
gzip
(90 kB)
]
|
|
Shadle, C.H., Mohammad, M., Carter, J.N. and Jackson, P.J.B. (1999).
Dynamic Magnetic Resonance Imaging: new tools
for speech research.
In Proceedings of the 14th International Congress of Phonetic
Sciences, Vol. 1, pp. 623-626, San Francisco, CA.
Abstract:
A multiplanar Dynamic Magnetic Resonance Imaging (MRI) technique that extends
our earlier work on single-plane Dynamic MRI is described.
Scanned images acquired while an utterasne is repeated are recombined to form
pseudo-time-varying images of the vocal tract using a simultaneously recorded
audio signal.
There is no technical limit on the utterance length or number of slices that
can be so imaged, though the number of repetitions required may be limited by
the subject's stamina.
An example of [pasi] imaged in three sagittal planes is shown; with a Signa GE
0.5T MR scanner, 360 tokens were reconstructed to form a sequence of 39
3-slice 16ms frames.
From these, a 3-D volume was generated for each time frame, and tract surfaces
outlined manually.
Parameters derived from these include: palate-tongue distances for [a,s,i];
estimates of tongue volume and of the area function using only the
midsagittal, and then all three slices.
These demonstrate the accuracy and usefulness of the technique.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
|
|
Jackson, P.J.B. and Shadle, C.H. (1999).
Modelling vocal-tract acoustics validated by flow experiments.
Journal of the Acoustical Society of America,
Vol. 105 (2, Pt. 2), p. 1161, Berlin, Germany (abstract).
Abstract:
Modelling the acoustic response of the vocal tract is a complex task, both from
the point of view of acquiring details of its internal geometry and of accounting
for the acoustic-flow interactions.
A vocal-tract acoustics program (VOAC) has been developed [P.
Davies, R. McGowan & C. Shadle, Vocal Fold Phys., ed. I. Titze, San
Diego: Singular Pub., 93-142 (1993)], which uses a more realistic,
aeroacoustic model of the vocal tract than classic electrical-analogue
representations.
It accommodates area and hydraulic radius profiles, smooth and abrupt area
changes, incorporating end-corrections, side-branches, and net fluid flows,
including turbulence losses incurred through jet formation.
Originally, VOAC was tested by comparing vowel formant frequencies (i) uttered
by subjects, (ii) predicted using classic electrical analogues, and (iii)
predicted by VOAC.
In this study, VOAC is further validated by comparing the predicted frequency
response functions for a range of flow rates with measurements of the radiated
sound from a series of mechanical models of unvoiced fricatives [C. Shadle,
PhD thesis, MIT-RLE Tech. Rpt. 506 (1985)].
Results show VOAC is more accurate in predicting the complete spectrum at a
range of flow rates.
Finally, preliminary work is presented with VOAC used to simulate the sound
generated at a sequence of stages during the release of a plosive.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
|
|
Jackson, P.J.B. and Shadle, C.H. (1999).
Analysis of mixed-source speech sounds:
aspiration, voiced fricatives and breathiness.
In Proceedings of the 2nd International Conference on Voice Physiology
and Biomechanics, p. 30, Berlin, Germany (abstract).
Abstract:
Our initial goal was to model the source characteristics of aspiration more
accurately. The term is used inconsistently in the literature, but there is
general agreement that aspiration is produced by turbulence noise generated
in the vicinity of the glottis. Thus, in order to model aspiration, we must refine
its concept, and in particular define its relation to other kinds of noise
produced near the glottis, such as breathiness and hoarseness. For instance,
do similar aeroacoustic processes operate transiently during a plosive release
and steadily during a breathy vowel? In unvoiced fricatives, localized sources
produce well-defined spectral troughs. We have therefore developed a series
of analysis methods that generate spectra for transient and
voice-and-noise-excited sounds. These methods include pitch-synchronous
decomposition into harmonic and anharmonic components (based on a
hoarseness metric of Muta et al., 1988), short-time spectra, ensemble
averaging, and short-time harmonics-to-noise ratios (Jackson and Shadle,
1998). These have been applied to a corpus of repeated nonsense words
consisting of aspirated stops in three vowel contexts and voiced and unvoiced
fricatives, spoken in four voice qualities, thus providing multiple examples of
mixed-source and transient-source speech sounds. Ensemble-averaged
spectra derived throughout a stop release show evidence of a
highly-localized noise source becoming more distributed. Variations by place
are also apparent, complementing and extending previous work (Stevens and
Blumstein, 1978; Stevens, 1993). The coordination of glottal and supraglottal
articulation, described and modelled for aspiration by Scully and Mair (1995),
is in a sense reversed for voiced fricatives. Use of the decomposition algorithm
on voiced fricatives revealed greater complexity than expected: the
anharmonic component appears sometimes to be modulated by the harmonic
component, sometimes to be independent of it, and tends to change from one
case to the other in the course of the fricative. In sum, we have made some
progress in describing not only spectral but time-varying properties of an
aspiration model, and in so doing, have improved our descriptions of other
mixed-source, time-varying speech sounds.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
|
|
Jackson, P.J.B. and Shadle, C.H. (1998).
Pitch-synchronous decomposition of mixed-source speech signals.
In Proceedings of the International Congress on Acoustics and Metting
of the Acoustical Society of America, Vol. 1, pp. 263-264,
Seattle, WA.
Abstract:
As part of a study of turbulence-noise sources in speech production, a
method has been developed for decomposing an acoustic signal into harmonic
(voiced) and anharmonic (unvoiced) components, based on a hoarseness
metric (Muta et al., 1988, J. Acoust. Soc. Am. 84, pp.1292-1301). Their
pitch-synchronous harmonic filter (PSHF) has been extended (to EPSHF) to
yield time histories of both harmonic and anharmonic components. Our corpus
includes many examples of turbulence noise, including aspiration, voiced and
unvoiced fricatives, and a variety of voice qualities (e.g. breathy, whispered).
The EPSHF algorithm plausibly decomposed breathy vowels, but the harmonic
component of voiced fricatives still contained significant noise, similar in shape
to (though weaker than) the ensemble-averaged anharmonic spectrum. In
general the algorithm performed best on sustained sounds. Tracking errors at
rapid transitions, and due to jitter and shimmer, were spuriously attributed to
the anharmonic component. However, the extracted anharmonic component
clearly exhibited modulation in voiced fricatives. While such modulation has
been previously reported (and also in hoarse voice), it was verified by tests on
synthetic signals, where constant and modulated noise signals were extracted
successfully. The results suggest that the EPSHF will continue to enable
exploration of the interaction of phonation and turbulence noise.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
|
|
Jackson, P.J.B. and Ross, C.F. (1996).
Application of active noise control to
corporate aircraft.
In Proceedings of the American Society of
Mechanical Engineers, Vol. DE93, pp. 19-25, Atlanta, GA.
Abstract:
Following the successful introduction of Active Noise Control (ANC) systems
as standard production fits on commuter aircraft (Saab2000, Saab340B and
Dash8Q series 100, 200 & 300), recent efforts have focused on developing
low-cost, low-weight systems for smaller corporate aircraft.
This paper describes the approach taken by Ultra to the new technical
challenges and the resulting improvements to the design methodology.
A review of system performance on corporate (King Air & Twin Commander)
turboprop aircraft shows repeatable global Tonal Noise Reductions (TNRs) of
>8 dBA throughout the whole cabin, achieving reductions >20 dB in some
locations at the blade-pass frequency (BPF), and major comfort benefits
throughout the flight envelope with a weight penalty of less than 20 kg.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
paper -
PDF: raw
(420 kB) -
DOC:
raw
(680 kB)
]
|
|
Book chapter
Jackson, P.J.B. (2005).
Mama and papa: the ancestors of modern-day speech science.
Chapter 15 in The Genius of Erasmus Darwin,
CUM Smith and RG Arnott (eds.),
Aldershot, UK: Ashgate, pp. 217-236, ISBN 0-754-63671-2.
Abstract:
While many talk of the rapid pace of technological advancement
in the present age, the lack of progress in the realm of ideas over
the past two hundred years is perhaps more remarkable,
which is most evident when looking at what had been accomplished so many
moons ago, back in the days of the Lunar Society.
As an engineering researcher of spoken language systems, my interest in
Erasmus Darwin's (ED's) work on speech was first ignited when I moved to
Lichfield and, like many others, I was struck by his achievements.
ED's writings on the subject of human speech included discussion of the
alphabet as an unsatisfactory phonetic representation of the spoken word,
of mechanisms of speech production and, indeed, of a mechanical speaking
machine (Darwin 1803; King-Hele 1981).
His studies of the acoustic properties of speech were limited, as no form
of sound reproduction had yet been invented and it was not until many
generations later that the physical behaviour of sound waves began to be
understood in any detail (Rayleigh 1877).
Nevertheless, his analysis of sounds on the basis of their manner
of production and place of articulation was highly insightful, and
is comparable to the classification scheme laid down by the
International Phonetic Association.
Furthermore, the wooden and leather device he built was capable of
pronouncing the vowel /{\cursa}/ and the English labial consonants /p/,
/b/ and /m/, which could be combined to create some simple utterances, as
in my title.
It is no surprise, therefore, that Darwin's contemporaries were impressed
(and sometimes alarmed!) by his inventions too.
Subject category:
Erasmus Darwin and technology.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
chapter -
PDF: raw
(1.4 MB),
gzip
(390 kB)
]
|
|
PhD thesis
Jackson, P.J.B. (2000).
Characterisation of plosive, fricative and aspiration
components in speech production.
PhD. Thesis,
Department of Electronics and Computer Science, University of
Southampton, Southampton, UK.
Abstract:
This thesis is a study of the production of human speech sounds by
acoustic modelling and signal analysis.
It concentrates on sounds that are not produced by voicing (although that may
be present), namely plosives, fricatives and aspiration, which all contain
noise generated by flow turbulence.
It combines the application of advanced speech analysis techniques with
acoustic flow-duct modelling of the vocal tract, and draws on dynamic magnetic
resonance image (dMRI) data of the pharyngeal and oral cavities, to relate the
sounds to physical shapes.
Having superimposed vocal-tract outlines on three sagittal dMRI slices of an
adult male subject, a simple description of the vocal tract suitable for
acoustic modelling was derived through a sequence of transformations.
The vocal-tract acoustics program VOAC, which relaxes many of the assumptions
of conventional plane-wave models, incorporates the effects of net flow
into a one-dimensional model (viz., flow separation, increase of entropy, and
changes to resonances), as well as wall vibration and cylindrical wavefronts.
It was used for synthesis by computing transfer functions from sound sources
specified within the tract to the far field.
Being generated by a variety of aero-acoustic mechanisms, unvoiced sounds are
somewhat varied in nature.
Through analysis that was informed by acoustic modelling, resonance and
anti-resonance frequencies of ensemble-averaged plosive spectra were examined
for the same subject, and their trajectories observed during release.
The anti-resonance frequencies were used to compute the place of occlusion.
In vowels and voiced fricatives, voicing obscures the aspiration and frication
components.
So, a method was devised to separate the voiced and unvoiced parts of a
speech signal, the pitch-scaled harmonic filter (PSHF), which was tested
extensively on synthetic signals.
Based on a harmonic model of voicing, it outputs harmonic and
anharmonic signals appropriate for subsequent analysis as time series or
as power spectra.
By applying the PSHF to sustained voiced fricatives, we found that, not
only does voicing modulate the production of frication noise, but that the
timing of pulsation cannot be explained by acoustic propagation alone.
In addition to classical investigation of voiceless speech sounds,
VOAC and the PSHF demonstrated their practical value in helping further to
characterise plosion, frication and aspiration noise.
For the future, we discuss developing VOAC within an articulatory
synthesiser, investigating the observed flow-acoustic mechanism in a
dynamic physical model of voiced frication, and applying the PSHF more
widely in the field of speech research.
|
Top
|
More:
Please contact Philip Jackson if you would like
further information.
[
table of contents - TXT: raw
(15 kB) |
abstract -
PDF: raw
(32 kB)
-
PS:
gzip
(34 kB)
|
thesis -
PDF: raw
(10 MB),
zip
(5.2 MB),
gzip
(5.2 MB) -
PS: raw
(11 MB),
gzip
(2.2 MB)
]
|