MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

This work discusses the construction of efficient Multimodal Large Language Models (MLLMs), focusing on the significance of various architectural components and data choices. The authors conducted extensive ablations of the image encoder, the vision language connector, and various pre-training data choices, identifying several key design lessons. They found that a careful mix of image-caption, interleaved image-text, and text-only data is vital for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks. They also discovered that the image encoder, image resolution, and image token count have a significant impact, while the vision-language connector design is less important.

The authors used these findings to create MM1, a family of multimodal models with up to 30 billion parameters, including both dense models and mixture-of-experts (MoE) variants. These models are SOTA in pre-training metrics and perform competitively after supervised fine-tuning on various established multimodal benchmarks. Thanks to large-scale pre-training, MM1 exhibits desirable features such as enhanced in-context learning and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Key takeaways:

The study focuses on building performant Multimodal Large Language Models (MLLMs) and discusses the importance of various architecture components and data choices.
For large-scale multimodal pre-training, a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art results across multiple benchmarks.
The image encoder, image resolution and the image token count have substantial impact on the model's performance, while the vision-language connector design is of comparatively negligible importance.
The authors built MM1, a family of multimodal models up to 30B parameters, that are state-of-the-art in pre-training metrics and achieve competitive performance on a range of established multimodal benchmarks.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Key takeaways:

Comments (0)

Newsletter