Stable Diffusion 3.0 uses diffusion transformers and flow matching for image generation, replacing the commonly used U-Net backbone with a transformer operating on latent image patches. This approach allows for more efficient use of compute and outperforms other forms of diffusion image generation. The model also benefits from flow matching, a new method for training Continuous Normalizing Flows (CNFs) to model complex data distributions. The company is also working on 3D image and video generation capabilities.
Key takeaways:
- Stability AI is previewing its Stable Diffusion 3.0, a next-generation flagship text-to-image generative AI model that aims to provide improved image quality and better performance in generating images from multi-subject prompts.
- The new model will also offer significantly better typography than prior Stable Diffusion models, enabling more accurate and consistent spelling inside of generated images.
- Stable Diffusion 3.0 is based on a new architecture, the diffusion transformer, similar to the one used in the recent OpenAI Sora model.
- Stability AI is also working on 3D image generation and video generation capabilities, with the aim of creating open models that can be used anywhere and adapted to any need.