Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

The article announces the release of Llama3-V, the first-ever multimodal model built on top of Llama3, which outperforms GPT3.5 and GPT4 in several benchmarks. The model, which was trained for under $500, offers a 10-20% performance boost over Llava, the current state-of-the-art model for multimodal understanding. Llama3-V uses the SigLIP model to embed input images into a series of patch embeddings, which are then aligned with textual tokens via a projection block.

The article also discusses the training framework, system optimizations, and model architecture in detail. To optimize for computation resources, the team used a simple caching mechanism and made optimizations on the MPS/MLX front. The model was trained by precomputing the image embeddings via SigLIP and then learning a projection matrix. After pretraining, supervised finetuning was performed to enhance the model's performance. The article concludes by highlighting that Llama3-V offers comparable vision abilities to models close to 100x larger in size like GPT4v, Gemini Ultra, and Claude Opus.

Key takeaways:

The article introduces Llama3-V, the first-ever multimodal model built on top of Llama3, which outperforms Llava, the current state-of-the-art model for multimodal understanding, by 10-20%.
Llama3-V uses SigLIP for image embedding and aligns textual and visual tokens using a projection block with two self-attention blocks.
The training process involves precomputing the image embeddings via SigLIP, learning a projection matrix, and then performing supervised finetuning.
The model offers comparable vision abilities of models close to 100x larger in size like GPT4v, Gemini Ultra, and Claude Opus, and can be trained in under $500.

Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

Key takeaways:

Comments (0)

Newsletter