EMO can also animate singing portraits with mouth shapes and facial expressions synchronized to the vocals. The system has outperformed existing methods on metrics measuring video quality, identity preservation, and expressiveness. However, there are ethical concerns about potential misuse of such technology to impersonate people without consent or spread misinformation. The researchers plan to explore methods to detect synthetic video.
Key takeaways:
- Researchers at Alibaba's Institute for Intelligent Computing have developed an AI system called EMO that can animate a single portrait photo and generate videos of the person talking or singing.
- EMO uses a diffusion model and directly converts the audio waveform into video frames, capturing subtle motions and identity-specific quirks associated with natural speech.
- EMO can also animate singing portraits with appropriate mouth shapes and facial expressions synchronized to the vocals, and it significantly outperforms existing methods on metrics measuring video quality, identity preservation, and expressiveness.
- Despite the promising results, ethical concerns remain about potential misuse of such technology to impersonate people without consent or spread misinformation, and the researchers plan to explore methods to detect synthetic video.