Non-verbal Communication

We have been studying automatic recognition of non-verbal communication (NVC) based primarily on face shape. NVC is used in every day conversations to compliment the spoken words we use. An automatic NVC system may be useful for user product evaluations, computer game characters and learning tools. We also use unconstrained spontaneous conversations to make the data more applicable outside the lab. Annotation of videos was conducted using public Internet worker pools (including Amazon Mechanical Turk, Samasource) to collect NVC perception data from multiple cultures. This data is publicly available as the TwoTalk corpus. This enables us to investigate cultural differences in NVC perception and to create culturally specialized NVC recognition systems.

What is NVC?

Examples of NVC

When communicating with other people, both the words and the way words are expressed are used to send and receive information. Non-verbal communication is the all of communication apart from that which is sent by spoken words. This includes facial expression, gesture, body positioning and other sounds not involved with communicating words. Non-verbal information is necessary for understanding many types of social situations.

Why is it useful in computer vision?

We intuitively use communication skills in everyday life. However, a human communicating with a computer currently requires a very different set of skills and usually a greater degree of effort. Typically, the computer interface is tailored around the computers systems rather than being designed to suit the human user. In contrast, "human centric" computing needs to be usable using skills humans already possess. Given that non-verbal communication is required to understand other humans, it would be useful for computers to also understand human non-verbal communication. We already see the first forays into the use of this type of interface with software and games using gesture interfaces while avoiding button based designs. Computer understanding of NVC enables novel interfaces, such as direct emotional interfaces possibly with computer generated characters and monitoring human reactions to events while interacting with the computer. This may have applications in computer gaming, human training software, product research, social science and art.

Data Collection

Many previous studies have investigated deliberately acted emotions which have been rated based on annotators from a single culture. However, natural communication occurs in a spontaneous fashion and observed in a specific cultural context. These differences can make systems that are trained on unrepresentative data work poorly in real applications. Spontaneous NVC has different timings and intensities of motion when compared to posed emotion. We attempt to address this by recording conversations with minimal experimental constraints. The recording was of two people talking while seated across a table. The video clips are then annotated using Internet crowd sourcing to produce ratings that are based from observers from a specific culture. These videos and annotations have been publicly released as the “TwoTalk corpus”. A significant amount of annotation data was performed by people based in India, the UK and Kenya.

The annotation categories used for annotation are: "I understand what you", "I agree with you", "I am thinking" and "I am asking a question". It is hoped that these signals will be more useful as they are more commonly expressed than most of the classic Ekman 6 culturally independent emotions (anger, disgust, fear, happiness, sadness and surprise).

Tracking of Facial Features

As humans change their expression and move parts of their body, their relative positions change with respect to time. This information is useful for computer vision as it provides a way to model the body and face for further processing and understanding. An automatic estimate of the position of a corresponding visual feature in video is called "tracking". But natural conversation contains a very wide range of expressions, body poses, fast motions and occasions when the view of certain features are blocked by obstructions (occlusions). This makes tracking natural conversations very challenging.

To improve the robustness to head pose changes, we have developed a way to track the face and incrementally learn new appearances of the face as it turns away from the camera. This adaptation to the new face appearance is an example of on-line learning. However, this method has difficulties with videos with extreme head pose and occlusions, so we resort to using LP flock based tracking for videos of spontaneous NVC.

Automatic Classification and Regression of NVC

We have investigated several approaches for recognition of non-verbal communication. Our current work uses our multi-cultural annotation data from the "TwoTalk corpus". The cultural differences in perception of non-verbal communication signals have not been considered in any previous computer vision paper. A key finding was that training an automatic system on data that is not representitive of test data results in reduced performance. We address this by training and testing our system on a specific culture. This enables the automatic system to better model the cultural differences in NVC perception.

Last update: Jan 2012.

See Also

A. Vinciarelli, M. Pantic and H. Bourlard, 'Social Signal Processing: Survey of an Emerging Domain', in Image and Vision Computing Journal, in press, 2009