Emu Video uses diffusion models and is initialized using a state-of-the-art text-to-image model pre-trained on large datasets of image-text pairs. The model architecture uses a U-Net with an embedded CLIP text encoder and adds 1D temporal convolutions and attention layers after each spatial component to model time dependencies. In human evaluations, Emu Video outperformed prior state-of-the-art methods for text-to-video generation, producing higher visual quality results with finer details, better motion smoothness, fewer artifacts, and greater temporal consistency across frames.
Key takeaways:
- Researchers from Meta have introduced a new text-to-video generation approach called Emu Video, which they claim sets a new standard in the field. The approach involves generating a high-quality image based on the text prompt, and then generating a video based on both the image and the original text prompt.
- Emu Video uses diffusion models and is initialized using a state-of-the-art text-to-image model pre-trained on large datasets of image-text pairs. The video generation model is trained using a two-step generation process that mimics the factorization idea.
- Two key innovations in Emu Video's training scheme enable directly generating high-resolution 512px videos: Zero Terminal SNR Noise Schedule, which improves video stability, and Multi-Stage Training, which leads to better final results.
- In human evaluations, Emu Video substantially outperforms prior state-of-the-art methods for text-to-video generation, including recent publications and commercial solutions. The generated videos better reflect the semantic content of the text prompts in both the spatial layout and temporal dynamics.