Alibaba’s new AI system ‘EMO’ creates realistic talking and singing videos from photos

Researchers at Alibaba's Institute for Intelligent Computing have developed an AI system, "EMO," that can animate a single portrait photo to generate videos of the person talking or singing. The system uses a diffusion model to directly convert audio to video, bypassing the need for 3D models or facial landmarks. It was trained on over 250 hours of talking head videos and can capture subtle motions and identity-specific quirks associated with natural speech.

EMO can also animate singing portraits with mouth shapes and facial expressions synchronized to the vocals. The system has outperformed existing methods on metrics measuring video quality, identity preservation, and expressiveness. However, there are ethical concerns about potential misuse of such technology to impersonate people without consent or spread misinformation. The researchers plan to explore methods to detect synthetic video.

Key takeaways:

Researchers at Alibaba's Institute for Intelligent Computing have developed an AI system called EMO that can animate a single portrait photo and generate videos of the person talking or singing.
EMO uses a diffusion model and directly converts the audio waveform into video frames, capturing subtle motions and identity-specific quirks associated with natural speech.
EMO can also animate singing portraits with appropriate mouth shapes and facial expressions synchronized to the vocals, and it significantly outperforms existing methods on metrics measuring video quality, identity preservation, and expressiveness.
Despite the promising results, ethical concerns remain about potential misuse of such technology to impersonate people without consent or spread misinformation, and the researchers plan to explore methods to detect synthetic video.

Alibaba’s new AI system ‘EMO’ creates realistic talking and singing videos from photos

Key takeaways:

Comments (0)

Newsletter