I am currently working on the EPSRC project Learning to Recognise Dynamic Visual Content from Broadcast Footage. The project covers the topics of sign language recognition, action recognition, and semantic video understanding. In all cases the idea is to exploit the large quantities of freely available broadcast footage such as signed TV programs broadcast by the BBC, and films or series released on optical media. Although this content is vast in scope, the automated labelling of the data (such as scripts and subtitles) is weak and unreliable. This necessitates the use of different learning approaches.
As part of the project, I have examined the exploitation of 3D information within natural action recognition, as a means to reduce the amount of variation within classes. To this end, a dataset of natural actions with 3D data was compiled, called Hollywood 3D from broadcast footage on 3D BluRay.
Histograms of Scene-flow (HOS)
I also developed a new action recognition descriptor based on my previous work in scene flow estimation. This encoded 3D motion information in a view invariant manner (paper). The spotlight video (1 minute summary) is below.
Below are two example visualisations of the HOS descriptor. Left is the HOS sampled at the detected interest points in an Eat sequence from Hollywood 3D. Right is a densely sampled HOS on the "Hands" sequence. In each case, the 4 quadrants relate to varying out of plane (elevation) orientations. Each green box contains 9 spatial subregions, each with a 4 bin orientation histogram (azimuth).