AI Hardware: Custom ASICs
Beyond NVIDIA: Cerebras, Groq, Tenstorrent, and dedicated inference accelerators. The hardware diversity is real and the cost economics matter.
Cerebras
Cerebras builds wafer-scale chips. The CS-3 packs 4 trillion transistors and 900,000 cores onto a single piece of silicon roughly the size of a dinner plate. The pitch: massive on-chip memory bandwidth, no inter-chip communication overhead, dramatic speedup on specific workloads (especially large-model inference and certain training patterns).
The architectural advantage. Wafer-scale eliminates inter-chip communication for many workloads. The 900K cores share on-die memory at terabytes-per-second bandwidth. For workloads that fit in on-chip memory, Cerebras is dramatically faster than GPU clusters.
The "fits in on-chip memory" constraint. The CS-3 has substantial on-chip memory but not unlimited. Models or activations that exceed on-chip capacity require off-chip memory access; the speed advantage diminishes. Workload fit matters; the right workload sees big gains.
The deployment shape. Cerebras is sold as systems, not chips. Single-system or cluster deployments. Substantial capital cost; available via cloud (Cerebras Inference) for trial. Production users include some research labs and specialised inference providers.
The real-world wins. Some inference benchmarks show 5-10x speedup vs comparable-cost GPU systems. Training certain model families also shows speedups. The wins are real but workload-specific; not a universal GPU replacement.
The maturity. As of 2026, Cerebras is established but niche. Software stack supports major frameworks but isn't as polished as CUDA. Production-grade for users who fit the niche; not yet a broad alternative to NVIDIA.
Groq
Groq's LPU (Language Processing Unit) is purpose-built for LLM inference. Deterministic execution; very high token-throughput on small batch sizes; latency dramatically lower than GPU. The pitch: real-time LLM applications that can't tolerate GPU's variable latency.
The deterministic-execution advantage. GPU latency varies because of warp scheduling, memory access patterns, contention. LPU executes deterministically, same input, same time. For applications where latency variability matters (real-time conversational AI, coding assistants), the determinism is valuable.
The throughput numbers. Groq cards achieve 500-1000+ tokens/second per LLM stream. GPU equivalents are 30-100 tokens/second on similar models. The 5-10x throughput at small batches is the headline.
The deployment shape. Groq sold cards and systems. Also offers cloud (GroqCloud) for trial. Capital cost moderate; integration cost moderate. Production users mostly inference-focused (chatbots, code completion, real-time AI products).
The "small batch wins, large batch loses" reality. At batch size 1-8, Groq dramatically beats GPU. At large batches (32+), GPU's parallelism catches up. Pick by your batch size; small-batch latency-sensitive applications fit Groq best.
The 2026 maturity. Groq is in production at several customer-facing AI companies. Software stack supports major frameworks. Not as broad-based as NVIDIA but viable for the right workloads.
Tenstorrent
Tenstorrent builds modular AI processors. Each chip has many small RISC-V-cored "tensix" tiles connected by a mesh. Open-source software stack. Pricing meaningfully below NVIDIA. The pitch: cost-effective alternative for inference and mid-scale training, with software openness as a differentiator.
The architectural shape. RISC-V cores are general-purpose; tensix tiles are tensor-specialised. Programmable; lets specialised kernels run efficiently. Open ISA; partial open source for the software stack.
The price advantage. Tenstorrent cards typically priced 30-50% below NVIDIA equivalents. Cloud rentals correspondingly cheaper. For cost-conscious deployments where software porting effort is acceptable, the savings are real.
The software-openness pitch. Open ISA, open compiler, open tooling. Differentiates from NVIDIA's CUDA black box. Appeals to teams that value control or auditability. Real open-source community building around the platform.
The maturity gap. Software stack is improving fast but lags NVIDIA's polish. Some kernels are highly optimised; others are not. Plan for hand-tuning critical paths; the savings vs NVIDIA can pay back the effort.
The position. As of 2026, Tenstorrent is third-tier in adoption (after NVIDIA and AMD). Real production deployments at smaller operators. The trajectory is improving; whether it reaches second-tier depends on continued execution.
AWS Inferentia/Trainium
AWS's in-house ML chips. Inferentia for inference; Trainium for training. Pricing meaningfully below NVIDIA on AWS. Software stack (Neuron SDK) integrates with PyTorch and TensorFlow. The pitch: lower cost on AWS for AWS-stack customers.
The cost-on-AWS advantage. Inferentia2 instances are typically 30-50% cheaper than equivalent NVIDIA instances on AWS. For AWS-native workloads, the savings are immediate. For non-AWS workloads, irrelevant, you can only run these on AWS.
The software-stack reality. Neuron SDK supports major frameworks. Some models work near-perfectly; others need adaptation. The "compile your model for Neuron" step is non-trivial; budget engineering time.
The training-vs-inference fit. Trainium2 is competitive on training cost-per-epoch for many models. Frontier-scale training (where Blackwell would be used) has more friction. Production training and fine-tuning often work fine on Trainium.
The vendor-lock-in concern. Neuron-compiled models don't run elsewhere. Migrating off AWS means recompiling or retraining. The savings are real but the lock-in is also real; weigh the trade-off.
The 2026 adoption. Many AWS customers using Inferentia for production inference. Trainium adoption smaller but growing. The "default for AWS-stack customers" position is largely achieved.
Where each wins
The decision framework:
- NVIDIA H100/H200/Blackwell, default for most. Software polish, ecosystem, capability.
- AMD MI300X, cost-conscious GPU alternative. ROCm has matured; price-performance is real.
- Cerebras, workloads that fit on-chip memory and benefit from no inter-chip overhead.
- Groq, latency-critical LLM inference at small batches.
- Tenstorrent, cost-conscious inference and mid-scale training where software-porting effort is acceptable.
- AWS Inferentia/Trainium, AWS-native workloads where cost savings on AWS justify the SDK integration.
The "default to NVIDIA" reality. For most teams, NVIDIA is the safe default. Software polish, ecosystem, talent availability all favor NVIDIA. Deviating requires specific reason.
The mixed-fleet pattern. Production at scale often runs mixed fleets. NVIDIA for training and complex inference. AMD or specialised for cost-sensitive inference. AWS chips for AWS-native workloads. The mix optimises capital efficiency.
The migration cost reality. Switching hardware costs engineering work, porting kernels, tuning performance, re-validating. The savings must justify the engineering. For most workloads, the math works at large scale; doesn't work at small scale.
The "watch this space" reality. New hardware vendors emerge regularly. Some succeed; most don't. Track the leaders; ignore the long tail until they're proven. Don't over-rotate on individual hardware bets.
Common antipatterns
Hardware lock-in by accident. Code written tightly to one vendor's primitives is hard to migrate. Use portable frameworks; abstract vendor specifics.
Refusing alternatives on principle. "We only use NVIDIA" leaves money on the table at scale. Evaluate alternatives empirically.
Switching for marginal savings. Engineering cost of migration must be paid back in real savings. Compute the math.
Single-vendor strategy at large scale. Vendor risk and pricing leverage favor mixed fleets above some scale threshold.
What to do this week
Three moves. (1) Compute your current hardware spend. Below ~$1M/year, single-vendor is fine; above, evaluate alternatives. (2) For one inference workload, model the cost on alternative hardware. The math reveals migration potential. (3) Verify your code uses portable frameworks (PyTorch, JAX, ONNX) rather than vendor-specific APIs. Portability is the prerequisite for vendor flexibility.