Chameleon has demonstrated broad capabilities, achieving state-of-the-art performance in image captioning tasks and outperforming Llama-2 in text-only tasks. It is also competitive with models like Mixtral 8x7B and Gemini-Pro, and can perform non-trivial image generation. Furthermore, it matches or exceeds the performance of larger models like Gemini Pro and GPT-4V, based on human judgments on a new long-form mixed-modal generation evaluation. This marks a significant advancement in the unified modeling of full multimodal documents.
Key takeaways:
- The paper introduces Chameleon, a family of early-fusion token-based mixed-modal models that can understand and generate images and text in any sequence.
- Chameleon has been evaluated on a wide range of tasks including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.
- Chameleon demonstrates broad capabilities, outperforming other models in image captioning tasks, text-only tasks, and non-trivial image generation.
- The model matches or exceeds the performance of larger models like Gemini Pro and GPT-4V, marking a significant advancement in the unified modeling of full multimodal documents.