By Teaching AI to Make Pictures and Write, Scientists Improve Its Grasp of Vision and Language

Researchers from Anthropic, Tsinghua University, Xi'an Jiaotong University, and MEGVII Technology have developed a novel framework called DREAMLLM for training large multimodal language models (MLLMs) that can understand and generate both images and text. The model uses diffusion models for image generation, score distillation for training, and introduces "dream queries" for extracting multimodal semantics. It demonstrated state-of-the-art results on common multimodal benchmarks, significantly outperforming other MLLMs.

The development of DREAMLLM brings us closer to AI assistants that can understand and generate both visual and textual information. It learns real-world patterns of interleaving text and images, preserves visual details by modeling images as pixels, and avoids bottlenecks by not forcing the model to match other image representations. Despite concerns around bias, safety, and misuse of generative models, DREAMLLM's capabilities hint at future applications in quickly generating customized visual content.

Key takeaways:

The authors propose DREAMLLM, a novel framework for training large multimodal language models (MLLMs) that can both understand and generate images and text.
By training on free-form documents, DREAMLLM learns real-world patterns of interleaving text and images, developing a joint understanding of vision and language.
Modeling images as pixels instead of discrete tokens preserves visual details, and the dream queries act as an interpreter between modalities.
Strong zero-shot performance shows the model develops a robust general intelligence spanning both images and text, hinting at future applications in quickly generating customized visual content.

By Teaching AI to Make Pictures and Write, Scientists Improve Its Grasp of Vision and Language

Key takeaways:

Comments (0)

Newsletter