AI Hardware: Custom ASICs
Beyond NVIDIA: Cerebras, Groq, Tenstorrent, and dedicated inference accelerators. The hardware diversity is real and the cost economics matter.
Cerebras
Wafer-scale chips with massive on-die memory. Inference: 1000+ tok/s on Llama 70B. Particularly strong on workloads that fit in their memory hierarchy. Pricing competitive with H100 on $/token.
Groq
Custom LPU (Language Processing Unit). Deterministic, extremely low latency. Inference-only. Sub-100ms first-token on most LLM sizes. Niche but strong for latency-critical apps.
Tenstorrent
RISC-V-based, open architecture, software stack improving. The “cost-leader” play. Adoption growing in cloud providers looking for NVIDIA alternatives.
AWS Inferentia / Trainium
Amazon’s in-house chips. Trainium for training, Inferentia for inference. Strong cost-per-token if you’re committed to AWS. Software (Neuron SDK) is decent.
Where each wins
- Latency-critical, high-volume inference: Groq or Cerebras.
- Cost-sensitive AWS workloads: Inferentia.
- Mainstream training and inference: still NVIDIA H200/B200, with AMD MI300 closing.