The study highlights that these AI models are not useless, but their abilities are not as comprehensive as their marketing suggests. They are likely to be highly accurate in interpreting human actions, expressions, and everyday objects, but their understanding of visual information is abstract and approximate. The researchers argue that these models do not have "visual understanding" in the human sense, but rather are informed about an image without actually seeing it.
Key takeaways:
- The latest language models like GPT-4o and Gemini 1.5 Pro, which are marketed as 'multimodal' and capable of understanding images, audio, and text, may not actually 'see' in the way humans do.
- A study by researchers at Auburn University and the University of Alberta found that these models struggle with simple visual tasks that a human child could easily accomplish, such as identifying whether two shapes overlap or counting the number of shapes in an image.
- The researchers suggest that these models don't actually have visual understanding, but rather match patterns in the input data to patterns in their training data.
- Despite their limitations, these models are still highly accurate at interpreting human actions, expressions, and everyday objects and situations, which is what they are primarily intended for.