Paper page - Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

The article introduces Ferret-v2, an upgraded version of the Large Language Model (LLM) Ferret, designed to overcome its limitations. The new model features three key designs: a flexible approach for any resolution grounding and referring, which improves the model's ability to process and understand high-resolution images; a multi-granularity visual encoding that integrates an additional DINOv2 encoder for better learning of diverse underlying contexts; and a three-stage training paradigm that includes a stage for high-resolution dense alignment before the final instruction tuning.

Experimental results show that Ferret-v2 significantly outperforms Ferret and other state-of-the-art methods. The improvements are attributed to its high-resolution scaling and fine-grained visual processing capabilities. The authors suggest citing the paper in a model, dataset, or space README.md to link it from the page.

Key takeaways:

Ferret-v2 is a significant upgrade to Ferret, designed to overcome its limitations and improve its performance on broader tasks.
The new model introduces a flexible approach for any resolution grounding and referring, which allows it to handle higher image resolution and understand images in greater detail.
Ferret-v2 incorporates a multi-granularity visual encoding by integrating an additional DINOv2 encoder, which helps the model learn better and diverse underlying contexts for global and fine-grained visual information.
A three-stage training paradigm is proposed, which includes an additional stage for high-resolution dense alignment before the final instruction tuning, leading to substantial improvements over Ferret and other state-of-the-art methods.

Paper page - Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Key takeaways:

Comments (0)

Newsletter