Meta Llama 4 Benchmarking Confusion: How Good Are the New AI Models?

Meta has released two new AI models, Maverick and Scout, as part of its Llama 4 family. These models are designed to enhance Meta's AI capabilities across platforms like Instagram, WhatsApp, and Messenger. Maverick and Scout are open-weights and multimodal, capable of generating text, images, and code. Scout is a smaller model with 17 billion parameters, while Maverick is a midsized model with 128 experts. Meta aims to compete with other AI companies by developing models that perform complex tasks efficiently without requiring extensive computing resources.

There has been some controversy regarding the benchmarking of these models. Meta claimed that its Maverick model outperformed ChatGPT-4o, but it was revealed that the model submitted for testing was a customized version optimized for conversationality. This led to criticism from LMArena, a benchmarking platform, about Meta's transparency. Despite this, the Maverick experimental model is currently ranked second on LMArena, tied with GPT-4o and Grok 3, while Google's Gemini 2.5 Pro holds the top spot. More details about the Llama 4 family, including additional models, are expected to be announced at LlamaCon, Meta's upcoming AI developers conference.

Key takeaways:

Meta has released two new AI models, Maverick and Scout, as part of its Llama 4 family, which are open-weights and multimodal.
There is controversy over Meta's benchmarking practices, as the model submitted to LMArena was optimized for conversationality, potentially skewing results.
Scout is a smaller model with 17 billion parameters and 16 experts, while Maverick is a midsized model with 17 billion parameters and 128 experts.
Meta's Llama Maverick experimental model is currently ranked second on LMArena, tied with GPT-4o and Grok 3, while Google's Gemini 2.5 Pro holds the first position.

Meta Llama 4 Benchmarking Confusion: How Good Are the New AI Models?

Key takeaways:

Comments (2)

Newsletter