Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

The article introduces Ferret-UI, a multimodal large language model (MLLM) designed specifically for understanding and interacting with user interface (UI) screens. Unlike general-domain MLLMs, Ferret-UI is equipped with referring, grounding, and reasoning capabilities, and uses an "any resolution" feature to enhance the details of UI screens. The screens are divided into two sub-images, which are separately encoded before being sent to the language models. The model is trained on a wide range of basic UI tasks, such as icon recognition and text finding, with the training samples formatted for instruction-following and region annotations for precise referring and grounding.

To improve Ferret-UI's reasoning ability, the authors compiled a dataset for more complex tasks, including detailed description, perception/interaction conversations, and function inference. After training, Ferret-UI demonstrated a high level of comprehension of UI screens and the ability to execute open-ended instructions. The model was evaluated using a comprehensive benchmark covering all the tasks it was trained on. Ferret-UI outperformed most open-source UI MLLMs and even surpassed GPT-4V on all the basic UI tasks.

Key takeaways:

The paper introduces Ferret-UI, a multimodal large language model (MLLM) specifically designed for understanding and interacting with user interface (UI) screens.
Ferret-UI uses 'any resolution' to enhance visual features of UI screens, which are divided into sub-images and encoded separately before being sent to the language models.
The model is trained on a wide range of UI tasks, with samples formatted for instruction-following and region annotations to facilitate precise referring and grounding.
Ferret-UI outperforms most open-source UI MLLMs and even surpasses GPT-4V on all elementary UI tasks, according to a comprehensive benchmark.

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Key takeaways:

Comments (0)

Newsletter