A group of academics at CVIT (Center for Visual Information Technology) in Hyderabad, India has successfully mastered the process of accurately lip-sync’ing dynamic, unconstrained talking face videos, using Generative Adversarial Networks (GANs). As you will see in the demo video, this includes matching a live language translation to an original video spoken in a native tongue. Wav2Lip works for any identity, voice, and language, including CGI faces and synthetic voices.
Up until now, researchers have really only been able to match spoken word audio to accurate lip movements layered on top of a static image or videos of specific people seen during the Machine Learning training phase. An interactive demo allows you to test Wave2Lip’s accuracy using up to 20 seconds of video and below is an example of some of the results that can be achieved, here dubbing Smash Mouth’s song ‘All Star’ over clips from some well-known films.
This link will take you to a downloadable PDF of the team’s academic paper. The open-sourced code, models and evaluation benchmarks can be found on GitHub. and you are encouraged to use the team’s work to conduct further research in this space.