Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Chameleon: Mixed-Modal Early-Fusion Foundation Models

May 21, 2024 - news.bensbites.com
The article introduces Chameleon, a new family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any sequence. The model uses a stable training approach, an alignment recipe, and an architectural parameterization specifically designed for the early-fusion, token-based, mixed-modal setting. It has been evaluated on a wide range of tasks including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.

Chameleon has demonstrated broad capabilities, achieving state-of-the-art performance in image captioning tasks and outperforming Llama-2 in text-only tasks. It is also competitive with models like Mixtral 8x7B and Gemini-Pro, and can perform non-trivial image generation. Furthermore, it matches or exceeds the performance of larger models like Gemini Pro and GPT-4V, based on human judgments on a new long-form mixed-modal generation evaluation. This marks a significant advancement in the unified modeling of full multimodal documents.

Key takeaways:

  • The paper introduces Chameleon, a family of early-fusion token-based mixed-modal models that can understand and generate images and text in any sequence.
  • Chameleon has been evaluated on a wide range of tasks including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.
  • Chameleon demonstrates broad capabilities, outperforming other models in image captioning tasks, text-only tasks, and non-trivial image generation.
  • The model matches or exceeds the performance of larger models like Gemini Pro and GPT-4V, marking a significant advancement in the unified modeling of full multimodal documents.
View Full Article

Comments (0)

Be the first to comment!