In video-related tasks, Gemini Ultra scored 62.74-shot in English video captioning, outperforming DeepMind Flamingo's 564-shot. For video question answering, Gemini Ultra scored 54.7% 0-shot, higher than SeViLA's 46.3%. In the audio domain, Gemini Pro scored 40.1 in automatic speech translation, significantly higher than Whisper v2's 29.1. For automatic speech recognition, Gemini Pro achieved a 7.6% word error rate, which is better than Whisper v3's 17.6%.
Key takeaways:
- Gemini Ultra (pixel only) outperforms GPT-4V in multi-discipline college-level reasoning problems, natural image understanding, OCR on natural images, document understanding, and infographic understanding.
- In mathematical reasoning in visual contexts, Gemini Ultra (pixel only) also performs better than GPT-4V.
- For English video captioning, Gemini Ultra outperforms DeepMind Flamingo. In video question answering, Gemini Ultra also surpasses SeViLA.
- In the audio category, Gemini Pro outperforms Whisper v2 in automatic speech translation and Whisper v3 in automatic speech recognition.