Features and Quantisation

In order to find correlations in sign it is first necessary to describe the sign. One of the main differences between sign recognition and gesture recognition is the added complexity of sign being a complex language. It is dependant not only on the motion of the hands but also on the hand shapes, the posture of the signer, the expressions and lip shapes they use. This is even before we have started to consider the various grammatical constructs which make sign a collection of rich and beautiful languages. In this work we have decided to concentrate on the learning mechanisms and so our features will remain basic, much more so than our previous work, and will not represent the true complexity of the problem. However since all that is required is features which can be quantised this work is easily expandable to include many of the other phoneme components of sign.

Features

Using the dedicated head and hand tracker of Buehler et al.[6] We take just the head and the hand positions on each frame.

Quantisation

Since the number of possible positions that the heads and hands could possibly occupy is large we cluster the positions into a code book. For the head there are 10 possible clusters and for each of the hands there are 20 possible clusters. The clustering is done in the x,y domain using k-means with a simple euclidean distance metric.

.Figure 2 - Left : the 10 clusters for the head position, Centre : the 20 clusters for the dominant right hand and Right : the 20 clusters for the non-dominant left hand.

Temporalising into Symbols

Sign is a temporal medium and as such it is logical to include some temporal information in the symbols describing each frame. We also need to add identifiers so that the learning mechanism can distinguish between the head and the 2 different hands. To this end we concatenate cluster indexes across 2 frames and add an identifier as shown below in table 1. We now have 3 symbols to describe each frame and from the subtitles an indication of where our target sign is likely to appear. Now we need to find the correlation between the sections of video. Read more about this in the Mining section.

Head Dominant
Non-Dominant
Symbol ID
1    
2    
3    
Cluster # on frame N
 05  
 06  
 15  
Cluster # on frame N-1
   05
   07
   12
Resulting Symbol
10505
20607
31512

Table 1 - Showing how the cluster numbers are combined across 2 frames to give symbols showing where the head and hands are now and where they were previously.