Multimodal Models: Vision, Audio, Video
By 2026 most frontier models are multimodal by default. Text-only is a special case, not the norm. Here is what the current crop can and cannot do.
How modalities tokenise
Each modality is converted to a sequence of tokens that share an embedding space with text. Images: a vision encoder (ViT-style) produces patch tokens. Audio: a wav2vec or similar encoder produces sound tokens. Video: per-frame patches plus temporal tokens.
The unified embedding space is what lets the model reason across modalities. Once everything is tokens, the same transformer attends across images, sound, and text.
Vision
Reading documents (text, tables, charts), interpreting diagrams, identifying objects, OCR. The 2024-2025 frontier multimodal models match dedicated vision systems on most tasks.
Audio
Transcription is solved. Voice cloning is uncomfortably good. Audio-question-answering (“what genre is this”) is improving. Real-time voice conversation is in production with Gemini, GPT, Claude.
Video
Newer and harder. Long videos require sampling frames sparsely; dense frame analysis is expensive. Currently strong at short-clip understanding (TikTok-length); weaker on hour-long material.
Where they fail
- Pixel-precise tasks (counting, exact measurement) lag dedicated vision.
- Long video reasoning still drops sharply past a few minutes.
- Generated images and video are detectable as synthetic; quality varies.