AI & ML Advanced By Samson Tanimawo, PhD Published Jun 2, 2026 5 min read

Multimodal Models: Vision, Audio, Video

By 2026 most frontier models are multimodal by default. Text-only is a special case, not the norm. Here is what the current crop can and cannot do.

How modalities tokenise

Each modality is converted to a sequence of tokens that share an embedding space with text. Images: a vision encoder (ViT-style) produces patch tokens. Audio: a wav2vec or similar encoder produces sound tokens. Video: per-frame patches plus temporal tokens.

The unified embedding space is what lets the model reason across modalities. Once everything is tokens, the same transformer attends across images, sound, and text.

Vision

Reading documents (text, tables, charts), interpreting diagrams, identifying objects, OCR. The 2024-2025 frontier multimodal models match dedicated vision systems on most tasks.

Audio

Transcription is solved. Voice cloning is uncomfortably good. Audio-question-answering (“what genre is this”) is improving. Real-time voice conversation is in production with Gemini, GPT, Claude.

Video

Newer and harder. Long videos require sampling frames sparsely; dense frame analysis is expensive. Currently strong at short-clip understanding (TikTok-length); weaker on hour-long material.

Where they fail