AI & ML Advanced By Samson Tanimawo, PhD Published Dec 26, 2026 5 min read

ML System Architecture Patterns

Beyond ‘put a model behind an API’, mature ML systems share common architectural patterns. Knowing the patterns saves you from reinventing each one badly.

Online inference

The most common pattern. Request comes in; model produces prediction; response goes out. Latency matters; high availability matters; predictable performance matters. Most production ML serving is online inference.

The serving stack. Model behind a service (Triton, TF Serving, TorchServe, vLLM for LLMs). Service behind a load balancer. Auto-scaling based on traffic. Standard web-service operational patterns apply.

The latency budget. Per-request latency budgets are tight: 100-500ms for interactive UX, 50-200ms for autocomplete. Model size, batch size, and infrastructure choices all affect latency. Profile the per-request breakdown; optimise the dominant component.

The scaling pattern. Stateless services scale horizontally. Each replica handles independent requests; load balancer distributes. Model loaded into each replica's memory. Cache-warmth and replica startup time become operational concerns.

The cost optimisation. GPU instances are expensive; utilisation matters. Batching multiple requests improves utilisation; trades some latency for throughput. Auto-scaling reduces cost during low-traffic periods. The economics of online inference depend heavily on getting these optimisations right.

Batch inference

Process many inputs at once. Schedule the job; wait for it to finish; consume the outputs. No latency requirement (within the schedule); throughput is the primary metric. Cheaper than online, bigger batches, simpler infrastructure, less HA.

The cost advantage. Spot instances can be used (jobs can resume if interrupted). Batch sizes are optimised for throughput, not latency. Total cost per inference is 5-50x lower than online inference for the same model.

The use cases. Bulk scoring (score all customers nightly), document processing (process today's uploads), training data preparation (run inference to label data for training). Anything that can be scheduled rather than served on-demand.

The infrastructure. Workflow orchestrator (Airflow, Dagster, AWS Step Functions). Parallel workers. Result writes to data warehouse or object store. Monitoring at job-completion level rather than request level.

The async pattern. Request → enqueue → worker processes → result available. Combines online's UX (request-response) with batch's economics (workers process at their own pace). Common for LLM applications where latency is multi-second anyway.

Streaming

Continuous inference on continuous data. Click stream, IoT sensors, log streams. Latency is moderate (seconds to a minute); volume is high. Kafka or similar streaming infrastructure feeds the model; predictions feed downstream systems.

The infrastructure. Stream processing framework (Kafka Streams, Flink, Spark Streaming). Model inference embedded as a stream operator. Outputs go to another stream or to storage. Stream processing primitives (windowing, joins, aggregations) compose with model inference.

The use cases. Real-time fraud detection. Real-time recommendation updates. Anomaly detection on telemetry. Real-time personalisation. Anything that benefits from acting on data as it arrives.

The state-management challenge. Streaming models often need state (last N events for this user, running statistics). State stores (RocksDB, Redis) integrated with streaming. State management is the harder operational problem; the inference is the easy part.

The latency vs accuracy. Streaming systems often must respond fast with whatever data has arrived. Late data may improve accuracy but arrives after the decision needs to be made. Design for the latency budget; accept partial-data accuracy.

RAG

Retrieval-Augmented Generation. The pattern for building LLM apps with proprietary or current information. The architecture: vector store containing your documents → retriever pulls relevant chunks for each query → LLM generates response conditioned on retrieved chunks. RAG is the dominant LLM application pattern in 2026.

The retrieval stage. Embed query and documents; find documents whose embeddings are closest to query embedding. Vector stores (Pinecone, Weaviate, Qdrant, pgvector) handle the indexing. Retrieval quality dominates RAG quality.

The generation stage. LLM gets the query plus retrieved chunks as context. Generates response grounded in retrieved content. Better retrieval → better generation; bad retrieval → confident hallucination.

The chunking question. How big should chunks be? 256 tokens? 1024? Whole documents? The answer depends on use case; 512-1024 tokens with overlap is a reasonable starting point. Iterate based on quality.

The hybrid retrieval. Pure vector similarity often misses keyword matches. Hybrid retrieval (vector + BM25) usually outperforms pure vector. Reranking with a cross-encoder further improves quality. Each step adds compute but improves results.

The eval question. RAG quality is hard to measure. Build retrieval-specific metrics (hit rate, MRR), generation-specific metrics (faithfulness, relevance), and end-to-end metrics (user task success). Without an eval suite, RAG iteration is blind.

Agent loop

The newest pattern. LLM as the orchestrator: decides what to do, calls tools, observes results, decides next step. Loop continues until task done. The architecture is recursive; the LLM is part of the control flow, not just an output node.

The structure. Prompt the LLM with: task description, available tools, history of actions and results. LLM emits next action (tool call). Runtime executes; appends result to history; calls LLM again. Loop until LLM emits "done".

The tool definition. Each tool: name, description, input schema, executor. LLM picks tool based on description; runtime validates input against schema; executes. Tool quality determines what the agent can do.

The state management. The history of actions accumulates context. Eventually exceeds the context window; agent must summarise or selectively forget. State management is non-trivial in long-running agents.

The error handling. Tools fail. Models hallucinate tool calls. The runtime must detect, recover, sometimes escalate to humans. Error handling is half the operational work; without it, agents are unreliable.

The cost reality. Each turn is an LLM call. Long agent sessions consume many tokens. Costs can be substantial; budgets and per-task caps prevent runaway scenarios.

Common antipatterns

Online inference for batch-shaped workloads. Pay online prices; get batch performance. Move to true batch.

Streaming without state management. Each event treated independently when state would help. Add state stores deliberately.

RAG without retrieval eval. Bad retrieval quality is invisible without eval. Build the eval before scaling.

Agent loops without iteration caps. Stuck loops burn money. Always cap; always log.

What to do this week

Three moves. (1) Map your top 3 ML use cases to one of these patterns. The classification surfaces architectural mismatches. (2) For RAG systems, add retrieval-specific eval if missing. The eval is what enables systematic improvement. (3) For agent systems, audit iteration caps and budgets. The first runaway loop is expensive; the prevention is cheap.