ML System Architecture Patterns
Beyond ‘put a model behind an API’, mature ML systems share common architectural patterns. Knowing the patterns saves you from reinventing each one badly.
Online inference
Request-response. Synchronous. Latency < 1s. Used for chat, search, recommendations. Bottleneck: model serving throughput. Solutions: caching, routing, smaller models for easy queries.
Batch inference
Process many inputs together, no latency constraint. Used for nightly scoring, document processing, embedding generation. Bottleneck: total compute cost. Solutions: spot instances, larger batches, batch APIs at provider level.
Streaming inference
Process events as they arrive. Used for fraud detection, anomaly detection, live recommendations. Bottleneck: latency budget per event. Solutions: feature stores, lightweight models, async pipelines.
RAG
Retrieval + generation. Online inference with an extra retrieval hop. Bottleneck: retrieval quality. Solutions: hybrid retrieval, reranking, structured prompting.
Agent loop
Plan + tool-call + observe + iterate. Long-running. Bottleneck: cost ceiling per task. Solutions: iteration limits, durable workspace state, multi-agent decomposition.