Chameleon: Mixed-Modal Early-Fusion Foundation Models

The article introduces Chameleon, a new family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any sequence. The model uses a stable training approach, an alignment recipe, and an architectural parameterization specifically designed for the early-fusion, token-based, mixed-modal setting. It has been evaluated on a wide range of tasks including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.

Chameleon has demonstrated broad capabilities, achieving state-of-the-art performance in image captioning tasks and outperforming Llama-2 in text-only tasks. It is also competitive with models like Mixtral 8x7B and Gemini-Pro, and can perform non-trivial image generation. Furthermore, it matches or exceeds the performance of larger models like Gemini Pro and GPT-4V, based on human judgments on a new long-form mixed-modal generation evaluation. This marks a significant advancement in the unified modeling of full multimodal documents.

Key takeaways:

The paper introduces Chameleon, a family of early-fusion token-based mixed-modal models that can understand and generate images and text in any sequence.
Chameleon has been evaluated on a wide range of tasks including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.
Chameleon demonstrates broad capabilities, outperforming other models in image captioning tasks, text-only tasks, and non-trivial image generation.
The model matches or exceeds the performance of larger models like Gemini Pro and GPT-4V, marking a significant advancement in the unified modeling of full multimodal documents.

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Key takeaways:

Comments (0)

Newsletter