Inference Optimization: vLLM, TGI, and TensorRT
Self-hosting an LLM in 2025 means picking an inference server. The choice changes throughput by 5-10x. Here is what each does, and where each wins.
Why inference servers exist
Running a transformer with vanilla PyTorch is fine for development. For production it’s 5-10x slower than necessary. The bottleneck isn’t compute, modern GPUs are mostly waiting on memory.
Inference servers exploit three optimisations: continuous batching (mix queries from different users in one batch), paged attention (efficient KV-cache memory layout), and speculative decoding (predict multiple tokens, verify cheaply).
The three options that matter in production: vLLM (UC Berkeley), TGI (Hugging Face), and TensorRT-LLM (NVIDIA).
vLLM
Open-source, fast, broad model support. Created the “paged attention” technique that drove the field forward. Easy to deploy with Docker.
Strengths: high throughput on most workloads, good model compatibility (Llama, Mistral, Qwen, Phi, almost anything on Hugging Face), strong community, OpenAI-compatible API.
Weaknesses: occasional ergonomic rough edges, fewer enterprise features than TGI.
TGI (Text Generation Inference)
Hugging Face’s production inference server. Closely tied to the Hugging Face ecosystem. Strong on long-running multi-tenant deployments.
Strengths: very mature, excellent monitoring/metrics out of the box, strong support for guardrails and structured generation, good documentation.
Weaknesses: slower than vLLM on some workloads (gap is closing), Apache 2 + restrictive HF terms in some scenarios.
TensorRT-LLM
NVIDIA’s inference framework. Compiles models to highly-optimised CUDA kernels.
Strengths: highest absolute throughput on NVIDIA GPUs, especially Hopper (H100, H200) and Blackwell. 20-50% faster than vLLM on the best-supported models.
Weaknesses: NVIDIA-only, complex compilation pipeline (each model needs a build step), narrower model support, steeper operational learning curve.
Side-by-side
| Aspect | vLLM | TGI | TensorRT-LLM |
|---|---|---|---|
| Throughput | High | High | Highest |
| Setup | Easy | Easy | Complex |
| Model coverage | Broad | Broad | Narrower |
| Hardware | CUDA + ROCm | CUDA + Inferentia | CUDA only |
Default picks
- Most teams, most models: vLLM. Fast, easy, broad coverage, free.
- Mature production with multiple tenants and dashboards: TGI.
- Maximum throughput on NVIDIA hardware, willing to invest in build pipelines: TensorRT-LLM.
- Cost-bound, large scale: TensorRT-LLM after the team has time to learn it.
If you’re unsure, deploy vLLM and revisit only when you can measure that you’re bottlenecked on inference (not on retrieval, not on the rest of your app). Premature optimisation here is real.