Multimodal Models: Vision, Audio, Video
By 2026 most frontier models are multimodal by default. Text-only is a special case, not the norm. Here is what the current crop can and cannot do.
How modalities tokenise
Multimodal models handle images, audio, and video by converting them to tokens the transformer can process. Each modality has its own tokenisation strategy. The quality of tokenisation determines how well the model can reason about the modality; clever tokenisation has driven much of the multimodal capability advance.
The image tokenisation. Images are split into patches (16×16 pixels typical); each patch is embedded into a vector token. A 256×256 image becomes 16×16 = 256 tokens. ViT (Vision Transformer) introduced this approach; it dominates modern vision-language models.
The audio tokenisation. Audio is converted to spectrograms or mel features; segments become tokens. Whisper-style approaches use 30-second windows tokenised to several thousand tokens. Newer approaches use neural audio codecs (EnCodec, SoundStream) that produce discrete tokens directly.
The video tokenisation. Video is the hardest, temporal AND spatial dimensions. Common approach: sample frames; tokenise each as image; add temporal position embeddings. Recent work uses spatiotemporal patches that tokenise space-time directly. Token count is high; video tokens consume context fast.
The unified tokenisation goal. Some research aims at a unified tokeniser handling all modalities. Promise: single model, all modalities, simpler training. Reality: not quite there yet; modality-specific tokenisers still dominate. Likely converges by 2027-2028.
Vision
Vision-language models are mature. GPT-4V, Claude with vision, Gemini Pro Vision can describe images, answer questions about them, perform OCR, identify objects, reason about diagrams. Capabilities approach human level for most casual tasks; specialised tasks (medical imaging, scientific imagery) still benefit from domain-specific training.
The capability spectrum. Description: "what's in this image". OCR: extract text. Reasoning: answer questions about content. Spatial: count objects, identify positions. Composition: how do parts relate. All of these work robustly on modern vision-LLMs.
The OCR sweet spot. Modern vision LLMs handle clean text in images near-perfectly. Handwritten or stylised text is harder; specialised OCR models (Google Document AI, AWS Textract) often beat general LLMs on noisy real-world text. For mixed text-and-context queries, vision LLMs win; for pure OCR, specialists win.
The diagram understanding. Charts, graphs, flowcharts. Vision LLMs read these capably for casual cases; complex multi-axis charts with overlapping data still confuse. Reading "approximately what value" works; reading exact numbers from chart axes is unreliable.
The medical/scientific imagery. General vision LLMs can describe X-rays superficially. They cannot reliably do diagnostic tasks. Specialised medical imaging models (trained on labelled radiology data) significantly outperform. For these domains, plan to train or fine-tune; don't rely on general models.
Audio
Audio models handle speech transcription, music understanding, sound classification. Whisper popularised the speech-to-text architecture; modern audio LLMs do much more, answer questions about audio content, describe music, classify environments. Real-time audio (sub-300ms response) is harder than offline; the latency-quality trade-off is real.
The transcription state. Whisper-large-v3 and follow-ons achieve human-parity transcription on clean audio in major languages. Code-switching, accented English, noisy environments still challenge. Speaker diarisation (who said what) adds another error layer.
The voice-AI pattern. ASR (audio to text) → LLM (text to text) → TTS (text to audio). End-to-end audio models (no intermediate text) are emerging, lower latency, better non-verbal handling. Production voice AI in 2026 is mostly the three-stage pattern; end-to-end is research-grade approaching production.
The music understanding. "Describe this song", "what genre", "what instruments". General audio LLMs handle these. Music-specific models (MusicLM, MusicGen) generate music; understanding-specific models (CLAP, MuQ) handle classification. The capability is real; the quality is below state-of-the-art for specialised music tools.
The environmental sound. Classifying "is this a baby crying" vs "is this a dog barking" works well. Real-world deployments include accessibility tools, security monitoring, industrial sound detection. The capability is mature for known sound classes; novel sounds need specialised models.
Video
Video understanding is the newest mature multimodal. Models can describe video, answer questions about events, identify timestamps, recognise actions. Video generation (Sora, Veo, Runway) is advancing rapidly. Compute requirements are substantial, video is many tokens; long video is many many tokens.
The understanding spectrum. Short clips (30 seconds): mature. Medium clips (5 minutes): works. Long video (1+ hour): challenging, context length and temporal reasoning both stretch. Most production video AI processes short clips; long video uses chunk-and-summarise pipelines.
The use cases that work. Video summarisation, content moderation, action recognition, sports analytics, accessibility (auto-generated descriptions). Each has production deployments at major operators. The technology has matured fast in 2024-2026.
The generation state. Sora-class models produce minute-long high-quality video from text prompts. Generation is compute-expensive: minutes-to-hours per video. Production use cases are pre-computed (movie pre-viz, advertising, social content) rather than real-time.
The cost structure. Video token counts are high. A 60-second video at modest resolution might tokenise to 50K-200K tokens. Inference on a 1-minute video query costs $0.10-$2.00 typically. Per-query cost limits some use cases; batch and async processing make video AI economic.
Where they fail
Specific failures to know about. Counting objects in cluttered images (often wrong). Reading clock times. Understanding diagrams with many overlapping elements. Long video without good temporal reasoning. Audio with multiple speakers in noisy environments. The long tail of multimodal tasks has many specific failure modes; build evals for the failures specific to your use case.
The counting failure. "How many people are in this image?" Wrong on dense scenes, often off by 30-50%. The model can describe density qualitatively ("many people") accurately; quantitative counts are unreliable. Use specialised counting models for dense-scene counting.
The clock-reading failure. Analog clocks confuse vision LLMs. Hands at 4:35 might be read as 4:30 or 5:35. Digital clocks work. The pattern reveals that fine-grained spatial reasoning is still imperfect.
The diagram complexity failure. Simple diagrams work. Complex multi-element diagrams with overlapping arrows, multiple groupings, hierarchical structure overwhelm. Performance degrades faster than humans expect.
The "describe and reason" gap. Models describe what they see well. Reasoning about what they see (counterfactuals, causal chains, novel combinations) is harder. The gap between "can describe" and "can reason about" is where most multimodal AI products live or die.
Common antipatterns
Trusting vision LLMs for diagnostic medical imaging. Specialised models exist; general LLMs underperform. Use the specialised tool.
Real-time video understanding without infrastructure. Compute requirements are real. Latency-sensitive applications need careful architecture.
Counting precision claims. Don't ship products that depend on counting accuracy from general vision LLMs.
Skipping modality-specific evals. Generic benchmark scores mislead. Build evals for YOUR multimodal use case.
What to do this week
Three moves. (1) For one multimodal use case, build a 50-example eval set with hard cases (cluttered images, accented audio, etc.). The eval reveals model fitness for production. (2) Compute per-query cost for your typical multimodal query at current API pricing. The number guides architectural decisions. (3) If you're considering a specialised modality model (medical, music, etc.), evaluate it head-to-head with the general LLM. The capability gap is often larger than expected.