Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

ScreenAI: A visual language model for UI and visually-situated language understanding

Apr 09, 2024 - research.google
The article introduces ScreenAI, a vision-language model developed for understanding user interfaces (UIs) and infographics. ScreenAI, which improves upon the PaLI architecture, is trained on a unique mixture of datasets and tasks, including a novel Screen Annotation task that requires the model to identify UI element information on a screen. The model achieves state-of-the-art results on UI- and infographic-based tasks, and best-in-class performance on Chart QA, DocVQA, and InfographicVQA compared to models of similar size.

The ScreenAI model is trained in two stages: a pre-training stage and a fine-tuning stage. The pre-training stage involves self-supervised learning to automatically generate data labels, while the fine-tuning stage uses data manually labeled by human raters. The article also discusses the creation of a pre-training dataset for ScreenAI, the use of a layout annotator to identify and label UI elements, and the application of an optical character recognition (OCR) engine to extract and annotate textual content on screen.

Key takeaways:

  • ScreenAI, a vision-language model for user interfaces and infographics, has been introduced, achieving state-of-the-art results on UI and infographics-based tasks.
  • ScreenAI is based on the PaLI architecture and uses a flexible patching strategy from pix2struct, allowing it to understand, reason, and interact with complex and varied UIs and infographics.
  • Three new datasets, Screen Annotation, ScreenQA Short and Complex ScreenQA, have been released for a comprehensive evaluation of ScreenAI's layout understanding and QA capabilities.
  • Despite achieving competitive performance on a number of public benchmarks, ScreenAI still lags behind larger models, indicating a need for further research.
View Full Article

Comments (0)

Be the first to comment!