GPU Economics: H100 vs H200 vs MI300
The hardware decision drives every other ML cost line. Here is what each chip actually delivers, where AMD’s MI300 fits, and the cost-per-token math you should run.
H100
The 2023 workhorse, still the bulk of frontier training fleets in 2026. 80GB HBM3, ~3 TB/s memory bandwidth, ~67 TFLOPS FP16 (with sparsity, ~133). Cost ~$25-30k per card; cloud rental $2-4/hour at scale. The H100 set the modern training standard; later cards iterate on it.
Where the H100 still wins. Pre-training at moderate scale. Cost-per-FLOP at the volume tier (after substantial fleet investment). Available capacity, the cloud market for H100 is liquid, while later-generation cards have queues.
The supply situation. H100 supply has caught up with demand by 2026. Multi-week queues from 2023-2024 are gone. Spot-market H100 hours are plentiful. For teams with flexible scheduling, H100 spot is the best price-performance available.
The depreciation story. H100s purchased in 2023 have largely amortised. Their cost-per-token of inference is now favourable. Operators with H100 fleets can offer competitive inference pricing because the capital cost is mostly behind them. The economic effect is durable; H100s will be cost-competitive for years.
H200
An H100 with more memory (141GB HBM3e vs 80GB) and faster memory bandwidth (~4.8 TB/s vs ~3). Same compute as H100. Designed for memory-bound inference workloads, large models running batched inference. Cost premium ~30-40% over H100 for ~70% better memory bandwidth.
The inference advantage. Modern LLM inference is memory-bandwidth-bound, not compute-bound. The model weights stream through memory each token; faster memory means more tokens per second. For 70B+ models with batched serving, H200 throughput is 1.6-1.8x the H100 at ~1.4x the cost, a clear win.
The training case. For training, H200 is marginal over H100. Training is more compute-bound than inference; the memory bandwidth boost helps less. Training fleets typically don't refresh from H100 to H200 unless the larger memory unblocks specific workloads.
The "fits a bigger model" advantage. 141GB HBM lets a single GPU hold more parameters before needing tensor-parallel. For 70B models in FP16 (~140GB), the H200 fits the entire model on one card; H100 requires sharding. Single-card serving simplifies the stack substantially.
The supply situation in 2026. H200 supply has mostly caught up. Multi-month queues from 2024 are gone. Cloud rental rates have stabilised at the 30-40% premium over H100.
MI300X
AMD's flagship. 192GB HBM3, ~5.3 TB/s memory bandwidth, ~163 TFLOPS FP16. Compute parity with H100/H200; memory significantly larger. Software stack (ROCm) has matured but still trails CUDA's polish. For memory-bound inference and large-context-window workloads, MI300X is competitive on price-performance; ROCm friction is the trade-off.
The memory advantage. 192GB is more than any H-series card. Models that need >141GB on a single card prefer MI300X. Long context windows (1M+ tokens) consume memory; MI300X handles them with less tensor parallelism.
The price advantage. MI300X is typically priced 10-20% below H200 per card. At cloud, AMD instances are 15-25% cheaper than equivalent NVIDIA. The price advantage compounds at fleet scale; teams running 1000+ cards see meaningful savings.
The software cost. ROCm has improved substantially but still has rough edges. Custom kernels written for CUDA need adaptation. Pre-built frameworks (PyTorch, vLLM) support ROCm but with occasional bugs. Plan for 1-2 engineer-months of integration work to switch a stack from CUDA to ROCm.
The strategic role. AMD has positioned MI300X as "everything CUDA can do, cheaper". The proposition is real for inference; less convincing for training (where ecosystem effects favor NVIDIA). Mixed fleets, NVIDIA for training, AMD for inference, are an emerging pattern.
Blackwell
NVIDIA's 2024-2025 generation. B100, B200, GB200 (Grace Blackwell super-chip). Memory and compute step-up over Hopper; HBM3e at higher densities; FP8 throughput dramatically improved. By 2026, Blackwell is shipping in volume but supply remains tight; cloud pricing reflects the premium.
The B200 specs. 192GB HBM3e, ~8 TB/s memory bandwidth, ~5x H100 FP8 throughput. The compute jump is the biggest in years; pre-training shifts substantially when Blackwell capacity arrives.
The GB200 superchip. Two B200 GPUs integrated with a Grace CPU and NVLink fabric. Designed for large-model training where intra-rack bandwidth dominates. The DGX GB200 system fits 36 GB200 chips in a rack with NVLink fabric, a single-rack training cluster for 100B-1T parameter models.
The training implication. Training compute-per-dollar improves substantially with Blackwell. Frontier labs that can secure Blackwell capacity train larger models faster. The supply-constrained landscape means access to Blackwell is itself a competitive advantage.
The inference implication. For inference, Blackwell's biggest advantage is FP8, running models at FP8 with similar quality to FP16 doubles throughput. Models that haven't been quantised to FP8 don't benefit; quantisation work becomes urgent for Blackwell-running operators.
Cost-per-token math
For inference, the deciding number is cost-per-token. Compute (TFLOPS × utilisation × hourly rate) and memory (size × bandwidth × utilisation) both factor in. For most LLM inference workloads in 2026: H200 wins price-performance for 13B-70B models, MI300X wins for 70B-200B (memory advantage), Blackwell wins for >200B and any FP8-quantised workload.
The inference workload taxonomy. Small models (<13B): H100 or even L40S/A10G is competitive. Mid models (13B-70B): H200 sweet spot. Large models (70B-200B): MI300X for memory or Blackwell for compute. Frontier (>200B): Blackwell, with significant capacity planning.
The training workload taxonomy. Pre-training: Blackwell when available, H100 spot otherwise. Fine-tuning: H100 dominates because supply is plentiful and the workload doesn't need Blackwell's compute. RLHF: H100 spot for cost; tight schedules push to H200.
The fleet-mix strategy. Most production fleets in 2026 are mixed: H100 for cost-sensitive inference, H200 or MI300X for high-throughput inference, Blackwell for training when available. The mix optimises capital efficiency.
The locked-in cost. Once you've built CUDA-only tooling, switching to ROCm has friction. Teams locked in to NVIDIA pay the NVIDIA premium until they invest in ROCm support. The lock-in is partial, major frameworks support both, but real for teams with custom kernels.
Common antipatterns
Buying for peak compute spec, ignoring memory. Most modern workloads are memory-bound. Memory bandwidth and capacity matter more than peak FLOPS for inference.
Refusing AMD on principle. ROCm has matured; price-performance on MI300X is real. Closing the door costs money for ideology.
Hand-wringing on Blackwell supply instead of using H100/H200. H100/H200 capacity is plentiful and cost-effective. Waiting for Blackwell delays revenue; running today's workloads on today's available hardware is the right call.
Single-vendor fleets at scale. Vendor risk and pricing leverage favor mixed fleets at 1000+ card scale. Single-vendor is fine for small fleets; risky for large.
What to do this week
Three moves. (1) Compute your current cost-per-token. Knowing the number unlocks all later optimisation; not knowing it means optimisation work goes blind. (2) For your top inference workload, model the cost-per-token on H100 vs H200 vs MI300X. The right answer depends on your model size and traffic shape, model it, don't guess. (3) If you're CUDA-only, evaluate the cost of a ROCm port for inference. The savings at fleet scale often pay back the porting investment within 6 months.