Making Sense: Multitouchless Interfaces and iGroup

The making sense project was born out of an EPSRC sandpit. Its vision was to improve investigative capability by developing new tools to summarise, fuse, analyse and visualize large datasets. It is a large collaboration between the Universities of Surrey, Imperial, Leeds, Liverpool, Middlesex, Cranfield, Dundee, Southampton and Salford . Each institution brings a different skill set to the project representing expertise covering computer science, design, law and psychology.

Work within the project by the University of Surrey has developed datamining and machine learning tools that allow patterns to be discovered that relate images, video and text. Combined with their multitouchless interface the system allows an analyst to quickly:

  1. Visualise the data.
  2. Summarise the data or subsets of the data
  3. Discriminatively mine rules that separate one set of data from another
  4. Project content into a visualization space that shows the semantic similarity


Watch the video



Taking inspiration from the film the minority report, we created a curved projection screen which forms one of the main displays. Two HD projectors are pre-distorted to correct for the curvature of the screen and stitched together to form a single 2500x800 pixel display. As the user stands at a distance from the screen, conventional means of interaction such as a mouse and keyboard are not applicable. The system therefore employs multi-touchless technology developed at CVSSP by Philip Krejov and Richard Bowden that uses a Microsoft Kinect Sensor to track the users head, hands and fingertips. 

It can interpret gestures similar to that of a tablet or smartphone, without the need for ever touching the screen. Finger interaction in this setting is difficult due to factors such as rapid motion and reduced spatial resolution.  To overcome these challenging circumstances, a graph based approach is used. This provides a fast and efficient means of localizing fingers and as a result the approach can detect and track multiple users in real-time, providing a means for collaborative input in the same workspace.  This work is shortly to be published at the International Conference of Face and Gesture in  Shanghai [4].


A second display directly in front of the user employs a ProMultis overlay on a HD TV from Sourcetech, to form a more traditional multitouch table. This replaces the previously home built table. This display provides the visualization of data in the semantic space which is the output of machine learning. Here, distance between objects represents the semantic similarity of documents in the dataset, projected through a multi dimensional scaling into a interactive finite element simulation. 

The system allows the user to select sets of data and perform multi media datamining to extract rules that link the data together. These rules provide the semantic similarity of the visualization space.

Data collection

The topic of research was the analysis of hard drives. In order to generate a synthetic dataset, the Liverpool university physiologists generated scripts for 4 individuals, each with different profiles and interests. These contained normal online behaviors and atypical behaviors for each subject. Over the course of 1 month, these scripts were followed on a day by day basis, embedding scripted behavior within normal “filler” behaviors. This generated a hard drive for each subject, containing text, images and videos from web pages, emails, YouTube videos, Flickr photos, and word documents; a truly mixed mode dataset. The forensic scientists from Cranfield were then able to reconstruct this data from the hard drive, and produce high level event timelines of the user activity, along with supporting digital evidence from the hard drive images.

CVSSP has worked on image and video datamining for some time [2] and the system employs real time tools developed by Andrew Gilbert and Richard Bowden at the CVSSP to display and interact with the mixed modality data.



The traditional method for clustering or grouping media is to use a large training set of labeled data. However, the Making System multi-touchless approach moves away from the use of large hand labelled training datasets. Instead, allowing the user to find natural groups of similar content based upon a handful of ”seed” examples using two efficient data mining tools originally developed for text analysis: min-Hash and APriori. This means that the user guides the correct grouping of the diverse media by identifying only a small subset of contradictory items [3] e.g. web, video, image and emails. The pair-wise similarity between the media is then collapsed into a 2D presentation for displaying on the touch table. More recent work has looked at automatically attaching linguistic tags to images harvested from the internet[1].

In addition to showing groups of similar media content, a summary of the media can be produced. This can efficiently identify the frequently reoccurring elements within the data. When all the events are used, this can provide a overall gist or flavour of the activities. Furthermore, discriminative mining can be used where a subset of (positive) events are mined against another subset of (negative) examples. Here we can highlight events or content that are salient in the positive set, allowing trends to be identified that, for example, describe activity that is particular to a specific time of day or day of the week.


  1. A Picture is Worth a Thousand Tags: Automatic Web Based Image Tag Expansion - Andrew Gilbert, Richard Bowden, In Proc Asian Conference on Computer Vision (ACCV 2012) 2012
  1. Gilbert A, Illingworth J, Bowden R, Action Recognition using Mined Hierarchical Compound Features, IEEE Trans Pattern Analysis and Machine Learning, Vol 33(5), pp883-897, May 2011.
  1. Gilbert A, Bowden R, iGroup: Weakly supervised image and video grouping, In Proc. Int. Conference Computer Vision, ICCV2011, Barcelona , Nov 2011.
  1. Krejov P, Bowden R, Multitouchless: Real-Time Fingertip Detection and Tracking Using Geodesic Maxima. To appear in International Conference on Face and Gesture, Shangia , China , April 2013.