AI & ML Advanced By Samson Tanimawo, PhD Published Dec 31, 2026 5 min read

The 100-Post Capstone: What I’ve Learned

Writing 100 posts on AI/ML over two years taught me what holds and what bends. The five lessons that survived contact with reality.

Scale wins, eventually

Across 100 posts on AI/ML, this lesson appears most often. Larger models perform better on most tasks. The compute-quality curve has not bent. Expectations of "scale will plateau" have repeatedly been wrong; scaling laws have repeatedly held.

The empirical pattern. Each generation of frontier models is larger; each is more capable; the relationship is reliable. GPT-2 to GPT-3 to GPT-4 to whatever's next: the scaling pattern holds. The "scale won't help anymore" prediction has been wrong every year.

The qualifier. Scale wins ON AVERAGE. Specific tasks have specific scaling curves. Some saturate; some don't. The smart move: scale for tasks where the curve hasn't saturated; use efficient methods for tasks where it has.

The implication for builders. Don't bet against scaling. Plan for capabilities that don't exist today but will when the next generation ships. Position to ship features that need next-generation capability without rewriting everything.

The implication for users. Today's "best" model isn't tomorrow's. The model market is volatile; lock-in to one provider hurts you. Build with abstraction; switch easily.

Evals matter more than models

The model is the engine. The eval is the steering wheel. Most teams underinvest in evals and overinvest in model selection. Strong evals reveal which models are actually best for YOUR task; weak evals leave decisions to vendor benchmarks (often misleading).

The model-selection trap. Teams agonise over GPT vs Claude vs Llama based on benchmark scores. The benchmarks may not reflect their task. Without their own evals, the choice is informed by marketing material.

The eval-investment payback. A solid eval suite (50-200 examples per task type, automated scoring) takes 1-2 engineer-weeks to build. It pays back in better model selection, faster iteration, and clear regression detection. The ROI is large; few investments compound this well.

The "what to evaluate" reality. Eval should match what users actually do. Not "academic benchmark performance"; actual production task quality. Build evals from production logs (anonymised), edge cases, adversarial examples.

The continuous-evaluation discipline. Run evals on every model update, every prompt change, every architectural change. Track results over time. The discipline catches regressions before they reach production.

Cost engineering compounds

AI costs grow with usage. Without cost engineering, costs grow faster than revenue. The teams that built cost discipline early, caching, routing, model tier management, have 5-10x lower spend than teams that didn't, at similar scale. The compounding is brutal.

The compounding mechanism. Each architectural choice has cost implications. Choices made early constrain later choices. The team that didn't build prompt caching in 2024 still doesn't have it in 2026 (and pays full price on every cached prompt). Compounding goes both ways.

The discipline-now-vs-later math. Cost engineering investment pays back in 6-12 months at any meaningful scale. Delaying is expensive: another 6-12 months of unoptimised spend. The math always favors starting now.

The specific practices. Prompt caching, model routing, output limits, batch APIs for non-realtime, response caching, smaller models for easy tasks. Each is independently valuable; combined they often produce 70-90% cost reduction.

The cost-aware culture. Engineers who see their feature's cost make better decisions. Per-feature cost dashboards visible to engineers; cost reviews part of feature launches. The cultural shift compounds over time.

Portable wins

Vendor lock-in costs more than portability investment. Building with abstractions (vendor-agnostic API; swappable models; multi-provider fallback) costs marginal engineering time but preserves enormous strategic optionality. The teams that bet on one vendor and hit a wall (price hike, model deprecation, provider outage) wish they had portability.

The abstraction principle. Define a model interface in your code. Implementations route to OpenAI, Anthropic, Google, or self-hosted. Switching is config change; not code rewrite.

The cost. Marginal, perhaps 10-20% extra engineering effort upfront. Plus discipline to keep it abstract over time as features grow.

The benefit. Strategic optionality. When a vendor raises prices, you switch. When they deprecate a model, you switch. When they have an outage, you fail over. Each scenario is real; portability handles them all.

The litmus test. "Could we switch our primary AI provider in a week?" If yes, you have portability. If no, you're locked in. The test is uncomfortable; do it; act on the result.

The boring work pays

Eval suites, monitoring, cost dashboards, compliance documentation, audit logs. None of these is exciting. All compound. The teams that built the boring infrastructure in 2023-2024 ship faster, more reliably, and at lower cost than teams that skipped it. The "we'll add that later" plan rarely results in adding it.

The eval-suite case. Boring; foundational. Without it, model decisions are educated guessing. With it, decisions are backed by data.

The monitoring case. Production AI is a complex system. Failures happen. Monitoring catches them; debugging requires logs. Build them in.

The cost-dashboard case. You can't optimise what you can't see. Cost dashboards make optimisation possible. The dashboard is one engineer-week; the savings recur monthly.

The compliance case. EU AI Act and similar regulations require documentation. Building it as you go is much cheaper than retrofitting after a regulator inquiry.

The audit-log case. Decisions made by AI need to be auditable. Logs enable post-incident review and compliance. Without logs, post-incident analysis is impossible.

The pattern. Boring infrastructure investments compound. Exciting feature work depreciates. The teams that look "less impressive" in 2024 are the teams that win in 2026 because the boring infrastructure handled the load.

Common antipatterns

Optimising for the demo, not the system. Demos are short; systems are long. Build for the long horizon.

Skipping evals "until the model stabilises". Models won't stabilise. Build evals now; iterate alongside.

Single-vendor lock-in for "simplicity". Simple now; expensive later. Pay the abstraction tax upfront.

Treating compliance as a future problem. Compliance debt accumulates fast. Build in from day one.

What to do this week

Three moves. (1) Pick the most boring infrastructure investment you've been delaying (eval suite, cost dashboard, audit logs). Start it. (2) Audit your vendor abstraction. If switching providers takes a month, build the abstraction layer now. (3) Document one cost-engineering practice you'll adopt this quarter. The discipline starts with one practice; it compounds.

This is post 100 of the AI/ML series. Thanks for reading. The lessons above are the through-line, what we've learned across 100 posts. The lessons aren't surprising; they're durable. Build for them.