VASA-1 could be used for advanced lip synching in games, creating virtual avatars for social media videos, and AI-based movie making. Despite its potential, the team has no plans for a public release or making it available for developers. The model can perfectly lip-sync to a song and handle different image styles, and it can create 512x512 pixel images at 45 frames per second in about 2 minutes using a desktop-grade Nvidia RTX 4090 GPU.
Key takeaways:
- Microsoft's new AI research paper introduces VASA-1, a model that can create a hyper-realistic talking face video from a single portrait photo and an audio file.
- The technology is currently only available for the Microsoft Research team, but the demo videos show impressive lip sync, realistic facial features, and head movement.
- One of the potential applications of VASA-1 is in advanced lip synching for games, creating AI-driven NPCs with natural lip movement, and creating virtual avatars for social media videos.
- Despite its potential, the team has stated that this is just a research demonstration with no plans for a public release or making it available to developers for use in products.