To improve Ferret-UI's reasoning ability, the authors compiled a dataset for more complex tasks, including detailed description, perception/interaction conversations, and function inference. After training, Ferret-UI demonstrated a high level of comprehension of UI screens and the ability to execute open-ended instructions. The model was evaluated using a comprehensive benchmark covering all the tasks it was trained on. Ferret-UI outperformed most open-source UI MLLMs and even surpassed GPT-4V on all the basic UI tasks.
Key takeaways:
- The paper introduces Ferret-UI, a multimodal large language model (MLLM) specifically designed for understanding and interacting with user interface (UI) screens.
- Ferret-UI uses 'any resolution' to enhance visual features of UI screens, which are divided into sub-images and encoded separately before being sent to the language models.
- The model is trained on a wide range of UI tasks, with samples formatted for instruction-following and region annotations to facilitate precise referring and grounding.
- Ferret-UI outperforms most open-source UI MLLMs and even surpasses GPT-4V on all elementary UI tasks, according to a comprehensive benchmark.