Systems That Don't Learn, Decay

Closing the loop. Why continuous learning isn't optional—systems that don't improve, degrade.

This post is part of the Mental Models for Production AI series, which explores the mental frameworks needed to evaluate, build, operate, and improve AI-powered features—focusing on practical decision-making.

Shipping an AI feature doesn't mean you're done. The logs look clean. Latency is normal. No one's complaining yet. But without feedback loops, you're coasting on the assumptions you made at launch — and those assumptions have a shelf life.

Here's a real example. Chip Huyen documents a case of a grocery demand forecasting model that launched successfully, but within a year "demand for some items was consistently being overestimated, which caused the extra items to expire...the model's predictions had become so bad that they could no longer use it."

The model wasn't sabotaged. Nobody changed the code. The world just changed — subtly, consistently, invisibly — until the outputs became wrong enough to matter.

A fair pushback: some systems run fine for years without continuous learning, especially when the use case is stable and the data distribution doesn't shift. That's true — but for most user-facing AI features, usage patterns, user expectations, and upstream models change enough that decay is the more common default. The question isn't whether your system will drift, but whether you'll know when it does.

The way I think about it: decay is the default state. Learning is the active choice you have to make.

AI system quality decay over time

Without active feedback loops, quality erodes silently. The most dangerous window is months 3–6: decay has started, but no one's noticed yet.

Where the Learn Layer Fits

In this series, I've been building up a mental model with four layers: Decision → Data → Execution → Learn.

Most of the energy in production AI goes into the first three. Should we build this? What data do we need? How do we architect the execution? Those are important questions. But an open-loop system — one that produces outputs but never feeds signal back in — has a fundamental design gap.

The Learn layer closes the loop.

The Learn Layer Map

The four components of the Learn layer: Feedback → Drift → Iterate → Retrain. Each feeds the next.

This isn't one step. It's four, and the order matters:

  1. Feedback — Collect signals from production usage

  2. Drift — Detect when the system's behavior or environment is changing

  3. Iterate — Make targeted improvements without full retraining

  4. Retrain — Rebuild the foundation when iteration isn't enough

You can't detect drift without feedback. You can't make a confident iterate-or-retrain decision without drift detection. The sequence is load-bearing.

In Post 8 on Build vs Run Discipline, I talked about the different muscles needed to build well and operate well. The Learn layer adds a third: the discipline to keep the system improving after it's running.

Feedback: The Input to Everything Else

There are two types of feedback signals in production AI systems, and both are imperfect in different ways.

Explicit feedback is feedback you ask for — thumbs up/down ratings, regeneration requests, corrections. It's easy to interpret: a thumbs down on a response is a clear negative signal. The problem is users rarely give it. Explicit feedback tends to be sparse and skewed toward strong reactions.

Implicit feedback is signal you collect organically from usage behavior, without asking. GitHub Copilot uses code acceptance as a positive signal. Midjourney uses image downloads and upscaling as preference indicators. ChatGPT can observe conversation length, regeneration patterns, and what users copy from responses.

Eugene Yan's LLM patterns analysis emphasizes collecting user feedback as part of building a production data flywheel. In practice, the tradeoff between these two signal types is clear: explicit feedback suffers from sparsity, while implicit feedback tends to be noisy. A user copying an AI-generated response doesn't always mean the response was good — maybe they were going to edit it heavily.

The practical approach: instrument both. Use implicit signals for volume and trend detection — they'll tell you when something's changing. Use explicit signals for ground truth calibration — they'll tell you what is changing and whether users are actually satisfied.

The minimum viable feedback loop doesn't require thumbs up/down. Start with: what gets regenerated? What gets deleted? What gets edited before use? If you can log those behaviors, you have a feedback signal. (In environments with strict privacy constraints or limited user interaction data, even aggregated usage patterns — error rates, task completion rates — can serve as a starting signal.)

The deeper point here is about timing. The data flywheel — better model → better experience → more usage → more signal → better model — only starts once feedback is instrumented. Teams that skip this at launch miss the early user behavior data that's most valuable for understanding where the system fails. Retroactively reconstructing it is rarely feasible — the patterns that matter most are tied to the specific context of early adoption.

Drift: Knowing When Something Changed

Not all drift requires the same response. That's worth dwelling on, because most teams treat all performance changes as the same kind of problem.

Building on Chip Huyen's work on data distribution shifts, it helps to distinguish three types:

Covariate shift — The distribution of inputs changes, but the relationship between inputs and correct outputs stays the same. Example: a model trained on users over 40 starts serving a younger demographic. The task is the same; the population changed. This might just require feature monitoring and doesn't always require retraining.

Label shift — The distribution of outcomes changes. A model trained when 80% of requests were type A now sees 50/50. The world changed what you're being asked to classify. Often signals a need to collect new representative data.

Concept drift — The same inputs should now produce different outputs, because the underlying relationship changed. The pricing model example: before COVID, a 3-bedroom apartment in San Francisco priced at $2M. During COVID, the same house should price lower. The features are the same; the right answer changed. This is the one that typically requires retraining.

The reason the distinction matters: treating covariate shift like concept drift leads to unnecessary, expensive retraining. Missing concept drift leads to outputs that are increasingly wrong in ways that are hard to explain.

Detection approaches that work in practice:

  • Statistical: compare summary statistics (mean, median, variance) between training data and current production data

  • Behavioral: track accuracy-related metrics, eval pass rates, human rating trends over time

  • For LLMs specifically: regression test against a fixed eval set after any upstream model version change — the model behind the API can change even if your code doesn't, which is a form of concept drift introduced by your vendor

One caution from the research: "feature distributions shift all the time, and most of these changes are benign." Alert thresholds need calibration to avoid the alert fatigue that makes monitoring teams ignore everything.

Iterate: Fix Before You Retrain

When performance metrics drop, the instinct is often to reach for retraining. That's usually the wrong first move.

The way I'd think through it — a decision ladder, from lowest cost to highest:

1. Is it a prompt problem?
Try prompt tuning first. Adjust instructions, add examples, tighten constraints. Prompting doesn't require labeled data and is far less resource-intensive than fine-tuning. Research on LLM best practices highlights prompt-improvement strategies as a meaningful lever before fine-tuning — and in practice, prompt optimization can be
surprisingly competitive for behavioral issues like format, tone, or focus.

2. Is your eval set stale?
Sometimes apparent performance drops are artifacts of an eval set that no longer reflects real usage. A practical heuristic, inspired by Hamel Husain's emphasis on eval-driven development: keep reading logs until you feel like you aren't learning anything new — then add those failure modes to your test cases. Updating the eval set is itself a form of learning.

3. Is it a capability gap at scale?
Fine-tune when context window limitations prevent you from giving the model enough examples, when domain-specific accuracy requirements exceed what prompting can achieve, or when you need consistent structured output at high volume.

4. Has the relationship fundamentally changed?
Retrain when you've identified concept drift and iteration hasn't closed the gap.

When performance drops: exhaust lower-cost options before reaching for retraining.

The eval loop is infrastructure, not a one-time exercise. Three practical levels for teams at different stages:

  • Unit tests (fast, cheap assertions that run on every deploy)

  • Human and model eval on a regular cadence (sample recent outputs, inspect for patterns)

  • A/B testing once the product is mature enough to run live experiments

For teams without dedicated ML infrastructure, spreadsheets and lightweight tools like Streamlit or Gradio are legitimate starting points. "Keep it simple. Don't buy fancy LLM tools. Use what you have first" — that's Hamel Husain's guidance, and I'd echo it. The value is in the discipline of reviewing outputs regularly, not in the sophistication of the tooling.

Retrain: When Iteration Isn't Enough

Most teams update models on gut feeling or arbitrary schedules. That's the specific problem the Learn layer addresses.

Better triggers:

  • Performance metric drops below a defined threshold on your eval set

  • Drift detection flags concept drift (not just covariate shift)

  • New data volume has accumulated and eval scores suggest it would help

The order matters: trigger on evidence, not schedules. "It's been three months" is not a trigger. "Our eval pass rate dropped from 87% to 71% over six weeks and prompt tuning didn't recover it" is. What counts as sufficient evidence will vary by system — a customer-facing classifier needs tighter thresholds than an internal summarizer. The principle is the same: define the threshold before you need it, so the decision isn't made under pressure.

Before shipping a retrained model, regression test it: the new model should outperform the current one on your eval set, specifically on the failure modes you collected. If it improves the target cases but breaks things that used to work, you have a tradeoff to evaluate — not an automatic deploy. Plan for rollback: keep the prior model version accessible so you can revert if production behavior diverges from eval results.

Drawing from Chip Huyen's work on real-time ML, retraining exists on a spectrum:

  1. Manual retraining (ad-hoc, gut-feeling triggered)

  2. Automated batch retraining (scheduled, stateless)

  3. Stateful fine-tuning (incremental updates on new data)

  4. Continual learning (triggered by performance metrics or drift detection)

Level 4 is sophisticated infrastructure. Level 1 with well-defined triggers is better than Level 4 with bad ones. The Grubhub case study she cites — stateful daily retraining reducing costs 45x compared to stateless approaches — is a compelling end state. But most teams should focus on getting the triggers right before optimizing the cadence. In regulated industries (healthcare, finance), each level may require additional validation and compliance review — the framework still applies, but the overhead per step is higher.

How to Start

You don't need MLOps infrastructure to begin. Even if you're a two-person startup, feedback instrumentation can be as simple as logging what users regenerate or delete — that's a few lines of code, not a platform investment. The minimum viable Learn layer, in order of priority for most user-facing AI features:

  1. Feedback instrumentation first. Implicit signals — what gets regenerated, deleted, edited. You can't learn without signal.

  2. A fixed eval set second. Even 20–30 representative test cases give you a baseline. You can't measure improvement without one.

  3. Prompt iteration third. Fix what you can without touching infrastructure.

  4. Drift monitoring fourth. Once you have enough signal, add statistical monitoring over time.

  5. Retraining pipeline last. Only when you've exhausted the options above.

If you're building internal tooling where collecting usage signal is harder, starting with a solid eval set may be more practical than waiting for feedback instrumentation. The order is a guideline, not a requirement — but the principle holds: build feedback collection before building retraining infrastructure.

The sequence matters. Teams that build retraining pipelines before they have feedback instrumentation are optimizing the wrong layer.

The Learn layer connects backward to everything in this series. Strong build discipline and run discipline are necessary. A team that also closes the feedback loop — that treats the system as something that should improve over time, not just something that should stay stable — has a fundamentally different relationship with their production AI than one that doesn't.

And it connects forward: The next post will turn the iterate-or-retrain question into a decision tree — the same decision ladder above, made concrete and interactive.

Where Are You in the Learning Loop?

An AI system that isn't learning is already getting worse — just quietly. The grocery demand forecasting model didn't fail loudly. It drifted, overestimated, wasted inventory, and became unusable while everything in the logs looked fine.

The four components — Feedback → Drift → Iterate → Retrain — give you a vocabulary for what "the system learning" actually means. Not everything needs to happen at once. But feedback instrumentation needs to happen first.

If you shipped an AI feature recently: what signal are you collecting from production? What would tell you that the outputs are getting worse? If you don't have an answer to either question, feedback instrumentation is your next step.

Part of the Mental Models for Production AI series. Previous post: Silent Failures Make Me Most Nervous. Next: When Should I Retrain or Tune? (Decision Tree) — coming soon.

Further Reading

Reply

or to participate.