Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Factorial Funds | Under The Hood: How OpenAI's Sora Model Works

Mar 20, 2024 - factorialfunds.com
OpenAI's Sora model is a diffusion model that can generate highly realistic videos based on text descriptions. The model, which is built on top of Diffusion Transformers (DiT) and Latent Diffusion, requires a significant amount of compute power to train, estimated at 4,200-10,500 Nvidia H100 GPUs for 1 month. For inference, Sora can generate about 5 minutes of video per hour per Nvidia H100 GPU. As Sora-like models get widely deployed, inference compute will dominate over training compute. The "break-even point" is estimated at 15.3-38.1 million minutes of video generated, after which more compute is spent on inference than the original training.

The Sora model has implications for video generation, synthetic data generation, data augmentation, and simulations. It could potentially replace some uses of stock video footage and be used to generate fully synthetic data. The model's ability to generate videos of high quality and detail suggests that it could be used in real-world applications. However, there are challenges such as the difficulty of editing generated videos and the need for intuitive user interfaces and workflows.

Key takeaways:

  • OpenAI's Sora model is a diffusion model that can generate highly realistic videos, demonstrating that scaling up video models is worthwhile and can lead to rapid improvements.
  • Companies like Runway, Genmo, and Pika are working on building intuitive interfaces and workflows around video generation models like Sora, which will determine their usability and usefulness.
  • Sora requires a significant amount of compute power to train, estimated at 4,200-10,500 Nvidia H100 GPUs for 1 month, and can generate about 5 minutes of video per hour per Nvidia H100 GPU.
  • As Sora-like models get widely deployed, inference compute will dominate over training compute. The "break-even point" is estimated at 15.3-38.1 million minutes of video generated, after which more compute is spent on inference than the original training.
View Full Article

Comments (0)

Be the first to comment!