AI & ML Intermediate By Samson Tanimawo, PhD Published Aug 26, 2025 8 min read

Inference Optimization: vLLM, TGI, and TensorRT

Self-hosting an LLM in 2025 means picking an inference server. The choice changes throughput by 5-10x. Here is what each does, and where each wins.

Why inference servers exist

Running a transformer with vanilla PyTorch is fine for development. For production it’s 5-10x slower than necessary. The bottleneck isn’t compute, modern GPUs are mostly waiting on memory.

Inference servers exploit three optimisations: continuous batching (mix queries from different users in one batch), paged attention (efficient KV-cache memory layout), and speculative decoding (predict multiple tokens, verify cheaply).

The three options that matter in production: vLLM (UC Berkeley), TGI (Hugging Face), and TensorRT-LLM (NVIDIA).

vLLM

Open-source, fast, broad model support. Created the “paged attention” technique that drove the field forward. Easy to deploy with Docker.

Strengths: high throughput on most workloads, good model compatibility (Llama, Mistral, Qwen, Phi, almost anything on Hugging Face), strong community, OpenAI-compatible API.

Weaknesses: occasional ergonomic rough edges, fewer enterprise features than TGI.

TGI (Text Generation Inference)

Hugging Face’s production inference server. Closely tied to the Hugging Face ecosystem. Strong on long-running multi-tenant deployments.

Strengths: very mature, excellent monitoring/metrics out of the box, strong support for guardrails and structured generation, good documentation.

Weaknesses: slower than vLLM on some workloads (gap is closing), Apache 2 + restrictive HF terms in some scenarios.

TensorRT-LLM

NVIDIA’s inference framework. Compiles models to highly-optimised CUDA kernels.

Strengths: highest absolute throughput on NVIDIA GPUs, especially Hopper (H100, H200) and Blackwell. 20-50% faster than vLLM on the best-supported models.

Weaknesses: NVIDIA-only, complex compilation pipeline (each model needs a build step), narrower model support, steeper operational learning curve.

Side-by-side

AspectvLLMTGITensorRT-LLM
ThroughputHighHighHighest
SetupEasyEasyComplex
Model coverageBroadBroadNarrower
HardwareCUDA + ROCmCUDA + InferentiaCUDA only

Default picks

If you’re unsure, deploy vLLM and revisit only when you can measure that you’re bottlenecked on inference (not on retrieval, not on the rest of your app). Premature optimisation here is real.