GitHub - ishan0102/vimGPT: Browse the web with GPT-4V and Vimium

The article discusses a project called vimGPT, which aims to explore the use of multimodal models for web browsing. The project is particularly interested in using GPT-4V's vision capabilities for this purpose. The challenge lies in determining what the model wants to click on without giving it the browser DOM as text. To address this, the author suggests using Vimium, a Chrome extension that allows web navigation using only the keyboard, to provide the model with a means to interact with the web.

The author invites collaboration and shares several ideas for the project's development. These include using the Assistant API for automatic context retrieval, creating a specialized version of Vimium for overlaying elements, using higher resolution images, fine-tuning LLaVa, using JSON mode for the Vision API, having the Vision API return general instructions, adding speech-to-text with Whisper or another model for accessibility, and making the project work for personal browsers instead of artificial ones.

Key takeaways:

The project vimGPT aims to explore the use of GPT-4V's vision capabilities for web browsing, using the Vimium Chrome extension to provide an interface for the model to interact with the web.
Several ideas for future development are proposed, including the use of the Assistant API for automatic context retrieval, a Vimium fork for overlaying elements, the use of higher resolution images, and fine-tuning LLaVa.
Other suggestions include using the JSON mode for the Vision API, having the Vision API return general instructions formalized by another call to the JSON mode version of the API, adding speech-to-text with Whisper or another model, and making the project work for personal browsers instead of artificial ones.
The project references two other GitHub repositories, Globe-Engineer/globot and nat/natbot, as resources.

GitHub - ishan0102/vimGPT: Browse the web with GPT-4V and Vimium

Key takeaways:

Comments (0)

Newsletter