GitHub - reworkd/tarsier: Vision utilities for web interaction agents

Tarsier is an open-source utility library developed by Reworkd for multimodal web agents. It is designed to facilitate the automation of web interactions using GPT-4(V). Tarsier visually tags interactable elements on a webpage, providing a mapping between elements and ids for GPT-4(V) to take actions upon. It also provides a textual representation of the page and OCR utilities to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

The library supports Google Cloud Vision for OCR services, with plans to add Amazon Textract and Microsoft Azure Computer Vision soon. Basic usage involves importing the necessary libraries and creating an instance of Tarsier with the OCR service of choice. The roadmap for Tarsier includes adding documentation and examples, cleaning up interfaces and adding unit tests, improving OCR text performance, and adding support for other browser drivers and OCR services as necessary.

Key takeaways:

Tarsier is an open-source utility library developed by Reworkd for multimodal web agents. It helps in mapping LLM responses back into web elements, marking up a page for an LLM to better understand its action space, and feeding a "screenshot" to a text-only LLM.
Tarsier works by visually "tagging" interactable elements on a page via brackets and an id, providing a mapping between elements and ids for GPT-4(V) to take actions upon.
Tarsier can provide a textual representation of the page and also offers OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.
Currently, Tarsier supports Google Cloud Vision OCR service, with plans to add support for Amazon Textract and Microsoft Azure Computer Vision in the future.

GitHub - reworkd/tarsier: Vision utilities for web interaction agents

Key takeaways:

Comments (0)

Newsletter