Sign Language Production using Neural Machine Translation and Generative Adversarial Networks (bibtex)

by Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield and Richard Bowden

Abstract:

We present a novel approach to automatic Sign Language Production using state-of-the-art Neural Machine Translation (NMT) and Image Generation techniques. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign gloss sequences using an encoder-decoder network. We then find a data driven mapping between glosses and skeletal sequences. We use the resulting pose information to condition a generative model that produces sign language video sequences. We evaluate our approach on the recently released PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach by sharing qualitative results of generated sign sequences given their skeletal correspondence.

View PDF

Reference:

Sign Language Production using Neural Machine Translation and Generative Adversarial Networks (Stephanie Stoll, Necati Cihan Camgoz, Simon Hadfield and Richard Bowden), In Proceedings of the British Conference on Machine Vision (BMVC), BMVA Press, 2018. (Oral, Recorded presentation, text2gloss code)

Bibtex Entry:

@InProceedings{Stoll18,
  Title                    = {Sign Language Production using Neural Machine Translation and Generative Adversarial Networks},
  Author                   = {Stephanie Stoll and Necati Cihan Camgoz and Simon Hadfield and Richard Bowden},
  Booktitle                = {Proceedings of the British Conference on Machine Vision (BMVC)},
  Year                     = {2018},

  Address                  = {Newcastle, UK},
  Month                    = {3 -- 6 } # sep,
  Publisher                = {BMVA Press},

  Abstract                 = {We present a novel approach to automatic Sign Language Production using state-of-the-art Neural Machine Translation (NMT) and Image Generation techniques. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign gloss sequences using an encoder-decoder network. We then find a data driven mapping between glosses and skeletal sequences. We use the resulting pose information to condition a generative model that produces sign language video sequences. We evaluate our approach on the recently released PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach by sharing qualitative results of generated sign sequences given their skeletal correspondence.},
  Comment                  = {<font color="red">Oral</font>, <a href="https://youtu.be/VisZLaZyblE?t=3457">Recorded presentation</a>, <a href="https://github.com/neccam/text2gloss">text2gloss code</a>},
  Crossref                 = {BMVC18},
  Url                      = {http://personalpages.surrey.ac.uk/s.hadfield/papers/Stoll18.pdf}
}