Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Large Vision Models

Dec 05, 2023 - yutongbai.com
The article introduces a new sequential modeling approach for learning a Large Vision Model (LVM) without the use of linguistic data. This is achieved by defining a common format known as "visual sentences" that can represent raw images, videos, and annotated data sources such as semantic segmentations and depth reconstructions. The model is trained to minimize cross-entropy loss for next token prediction using a wide variety of visual data. The model's scalability is demonstrated across various scales of model architecture and data diversity, and it is shown to be effective in solving many different vision tasks by designing suitable prompts at test time.

The LVM's scalability across model and data size is demonstrated, with larger LVMs performing better on downstream tasks. The model's performance is evaluated on different sub-components of datasets, showing that it benefits from each single image, video, and annotation. The article also presents results of the LVM predicting the next frame in video sequences, handling in and out of distribution prompting examples, and compositing multiple tasks within a single prompt. The model is also shown to be effective in handling tasks that are not easily describable in language, non-verbal IQ tests, and a variety of simple vision tasks.

Key takeaways:

  • The researchers introduced a novel sequential modeling approach, Large Vision Model (LVM), that does not require any linguistic data. Instead, it uses a common format called "visual sentences" to represent raw images, videos, and annotated data sources.
  • The LVM is trained on a wide variety of visual data (420 billion tokens) to minimize cross-entropy loss for next token prediction. The model has shown effective scalability across various scales of model architecture and data diversity.
  • The LVM can perform various vision tasks by designing suitable prompts at test time. It can predict the next frame in a video, handle out-of-distribution prompting examples, and composite several tasks within a single prompt.
  • The LVM can also handle tasks that are not easily describable in language, non-verbal IQ tests, and a variety of simple vision tasks such as object replication, relighting, and zooming in.
View Full Article

Comments (0)

Be the first to comment!