Over a million developers have joined DZone.

Using AI to Help Improve Lip Reading

DZone's Guide to

Using AI to Help Improve Lip Reading

Artificial intelligence is becoming more common and pervasive. Find out the latest domain where it's making a difference.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.


AI has made some impressive strides in understanding and processing speech in recent times.  For instance, I wrote recently about a startup that is using machine learning to analyze our conversations for signs of neurological disorders such as Alzheimer’s and Parkinson’s.

Another fascinating project has emerged from researchers at Oxford University.  They have developed a new product, called LipNet, which they believe is capable of lip reading considerably better than existing software.

Read My Lips

The software, which was documented in a recent paper, claims to be the most accurate at understanding what someone is saying, just by tracking their lips.  The team claims that it is capable of doing so with 93.4% accuracy, which is not only considerably better than most other lip reading applications on the market, but is also far superior to the best human lip readers, who manage an accuracy of around 52%.

“Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end,” the authors write.

In other words, the researchers looked out for sentences rather than specific words and used this context combined with deep learning to allow the system to then analyze an entire sentence, deciphering individual words as they go.

The team believes that the service could eventually be used by the hearing impaired in a smartphone type service to allow them to better lip read.

It’s certainly a fascinating project and a further indication of the progress that’s being made.  Check out the video below to see LipNet in action.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

artificial intelligence ,big data ,healthcare

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}