Experimental results show that Ferret-v2 significantly outperforms Ferret and other state-of-the-art methods. The improvements are attributed to its high-resolution scaling and fine-grained visual processing capabilities. The authors suggest citing the paper in a model, dataset, or space README.md to link it from the page.
Key takeaways:
- Ferret-v2 is a significant upgrade to Ferret, designed to overcome its limitations and improve its performance on broader tasks.
- The new model introduces a flexible approach for any resolution grounding and referring, which allows it to handle higher image resolution and understand images in greater detail.
- Ferret-v2 incorporates a multi-granularity visual encoding by integrating an additional DINOv2 encoder, which helps the model learn better and diverse underlying contexts for global and fine-grained visual information.
- A three-stage training paradigm is proposed, which includes an additional stage for high-resolution dense alignment before the final instruction tuning, leading to substantial improvements over Ferret and other state-of-the-art methods.