Meta figured out how to make better videos by turning text into images first

The article discusses Emu Video, a new text-to-video generation approach introduced by researchers from Meta. Emu Video is designed to generate high-quality and temporally consistent videos using a text prompt as input. The process involves two steps: generating a high-quality image based on the text prompt, and then generating a full video conditioned on both the synthesized image and the original text prompt. This approach allows the model to focus on inferring the motions and transformations to apply to the image to match the text, rather than having to identify the correct visual concepts and their relations.

Emu Video uses diffusion models and is initialized using a state-of-the-art text-to-image model pre-trained on large datasets of image-text pairs. The model architecture uses a U-Net with an embedded CLIP text encoder and adds 1D temporal convolutions and attention layers after each spatial component to model time dependencies. In human evaluations, Emu Video outperformed prior state-of-the-art methods for text-to-video generation, producing higher visual quality results with finer details, better motion smoothness, fewer artifacts, and greater temporal consistency across frames.

Key takeaways:

Researchers from Meta have introduced a new text-to-video generation approach called Emu Video, which they claim sets a new standard in the field. The approach involves generating a high-quality image based on the text prompt, and then generating a video based on both the image and the original text prompt.
Emu Video uses diffusion models and is initialized using a state-of-the-art text-to-image model pre-trained on large datasets of image-text pairs. The video generation model is trained using a two-step generation process that mimics the factorization idea.
Two key innovations in Emu Video's training scheme enable directly generating high-resolution 512px videos: Zero Terminal SNR Noise Schedule, which improves video stability, and Multi-Stage Training, which leads to better final results.
In human evaluations, Emu Video substantially outperforms prior state-of-the-art methods for text-to-video generation, including recent publications and commercial solutions. The generated videos better reflect the semantic content of the text prompts in both the spatial layout and temporal dynamics.

Meta figured out how to make better videos by turning text into images first

Key takeaways:

Comments (0)

Newsletter