@inproceedings{HaqEtAl_UKSpeech08,
        AUTHOR  =       "Haq, S. and Jackson, P. J. B. and Edge, J.",
        TITLE   =       "Audiovisual Emotion Recognition in an {English} Database",
        BOOKTITLE =	"Proc.\ One-day Mtg. for Young Spch.\ Res. (UK Speech'08)",
	ADDRESS =	"Guildford, UK",
        PAGES   =       "6",
        MONTH   =       "July",
        YEAR    =       "2008",
	ABSTRACT = 
"Human communication is based on verbal and nonverbal information, e.g., 
facial expressions and intonation cue the speaker's emotional state. 
Important speech features for emotion recognition are prosody (pitch, 
energy and duration) and voice quality (spectral energy, formants, MFCCs, 
jitter/shimmer). For facial expressions, features related to forehead, eye 
region, cheek and lip are important. Both audio and visual modalities 
provide relevant cues. Thus, audio and visual features were extracted and 
combined to evaluate emotion recognition on a British English corpus. The 
database of 120 utterances was recorded from an actor with 60 markers 
painted on his face, reading sentences in seven emotions (N=7): anger, 
disgust, fear, happiness, neutral, sadness and surprise. Recordings 
consisted of 15 phonetically-balanced TIMIT sentences per emotion, and 
video of the face captured by a 3dMD system. A total of 106 utterance-level 
audio features (prosodic and spectral) and 240 visual features (2D marker 
coordinates) were extracted. Experiments were performed with audio, visual 
and audiovisual features. The top 40 features were selected by sequential 
forward backward search using Bhattacharyya distance criterion. PCA and LDA 
transformations, calculated on the training data, were applied. Gaussian 
classifiers were trained with PCA and LDA features. Data was jack-knifed 
with 5 sets for training and 1 set for testing. Results were averaged over 
6 tests.

The emotion recognition accuracy was higher for visual features than audio 
features, for both PCA and LDA. Audiovisual results were close to those 
with visual features. Higher performance was achieved with LDA compared to 
PCA. The best recognition rate, 98%, was achieved for 6 LDA features (N-1) 
with audiovisual and visual features, whereas audio LDA scored 53%. Maximum 
PCA results for audio, visual and audiovisual features were 41%, 97% and 
88% respectively. Future work involves experiments with more subjects and 
investigating the correlation between vocal and facial expressions of 
emotion."
}