Llava-1.5, an improved version of Llava, combines a visual encoder and an open-source chatbot to understand images and text. However, it struggles with complex images and text recognition. Adept's Fuyu-8B is designed to understand unstructured data like charts, graphs, and screens. However, it lacks built-in moderation mechanisms or prompt injection guardrails, raising concerns about potential misuse. Despite these challenges, the trend towards open-sourcing multimodal models continues.
Key takeaways:
- OpenAI's GPT-4V is a multimodal model that can understand both text and images, but it has been criticized for its inability to recognize hate symbols and its tendency to discriminate against certain demographics.
- Despite these issues, other companies and independent developers are releasing open source multimodal models, such as Llava-1.5 by a team of researchers from the University of Wisconsin-Madison, Microsoft Research and Columbia University, and Fuyu-8B by Adept.
- Llava-1.5 has shown promise in understanding images and their context, but struggles with recognizing text and interpreting complex images. Its use is also restricted for commercial purposes due to its training data being generated by ChatGPT.
- Adept's Fuyu-8B is designed to understand unstructured data such as charts, diagrams, and software interfaces. However, it lacks built-in moderation mechanisms or prompt injection guardrails, raising concerns about potential misuse.