The authors used these findings to create MM1, a family of multimodal models with up to 30 billion parameters, including both dense models and mixture-of-experts (MoE) variants. These models are SOTA in pre-training metrics and perform competitively after supervised fine-tuning on various established multimodal benchmarks. Thanks to large-scale pre-training, MM1 exhibits desirable features such as enhanced in-context learning and multi-image reasoning, enabling few-shot chain-of-thought prompting.
Key takeaways:
- The study focuses on building performant Multimodal Large Language Models (MLLMs) and discusses the importance of various architecture components and data choices.
- For large-scale multimodal pre-training, a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art results across multiple benchmarks.
- The image encoder, image resolution and the image token count have substantial impact on the model's performance, while the vision-language connector design is of comparatively negligible importance.
- The authors built MM1, a family of multimodal models up to 30B parameters, that are state-of-the-art in pre-training metrics and achieve competitive performance on a range of established multimodal benchmarks.