Hey there, human — the robots need you! Vote for IEEE’s Robots Guide in the Webby Awards.

Close bar

Disney Does Better Dubbing

More than 9,600 alternatives found for "clean swatches"

2 min read

Disney Does Better Dubbing
Photo: Getty Images

Bad dubbing on foreign films could one day be replaced with better lip-synched audio with the aid of software from Disney Research.

Speech redubbing, used for the translation of movies, television shows and video games into another language, or the removal of offensive language for television networks, usually involves careful scripting to choose words that match lip motions and a subsequent re-recording by actors. The weakness of dubbing lies in how easy it is to detect even subtle discrepancies between spoken words and facial motions. 

To overcome this challenge, scientists at Disney Research Pittsburgh and the University of East Anglia in England are developing automated video redubbing strategies that find plausible word sequences to match actors' speech motions. They relied on the extreme level of ambiguity inherent in reading lips to increase the number of word possibilities that dubbing could place in people's mouths.

The scientists will present their findings on 23 April at the IEEE International Conference on Acoustics, Speech and Signal Processing in Brisbane, Australia.

The scientists based their approach on "dynamic visemes," or facial movements associated with speech. They found that this could produce far more potential word sequences than approaches that rely on static visemes, which are static lip shapes associated with speech.

To develop their program, the researchers recorded an actor speaking more than 2,500 sentences that were phonetically diverse. This video was segmented into roughly 50,000 facial movements and clustered to form 150 classes of dynamic visemes. The software then analyzed the dynamic visemes associated with the video of any given utterance. The researchers found that the software, when used in combination with a pronunciation dictionary and a language model, could find a vast number of spoken sounds that were in sync with the original video.

For instance, the scientists found that the facial movements an actor makes when saying "clean swatches" are the same as those for more than 9,600 phrases, such as "yes rots," "likes swats," "need no pots," "then swine to," "tikes rush to," "like to watch you," "to lease to mine curse," "dicier mutts," "seed lumine anew," and "she is to a scissor." Although these alternative word sequences can vary in the number of syllables, they can still remain visually consistent with speech motions. (See a video here.)

The Conversation (0)