Build Discipline vs Run Discipline

Two different mindsets required for AI features. What it takes to build well vs what it takes to run well—they're not the same.

This post is part of the Mental Models for Production AI series, which explores the mental frameworks needed to evaluate, build, operate, and improve AI-powered features—focusing on practical decision-making.

Building it right and running it right require different muscles. In the series overview, I
introduced the Execution Layer as "Build it and run it (two different disciplines)." This post is about what that actually means in practice—what each discipline contains, where teams tend to under-invest, and how deployment checklists force you to think about both before shipping.

Two Different Disciplines

Imagine this scenario - a team spends weeks on architecture. They design the prompt chain carefully. They build eval suites. They run shadow deployments. Then they ship—and within a month, something breaks silently. Nobody set up monitoring. There's no runbook. The on-call engineer doesn't know what "healthy" looks like for this feature.

Or the inverse: a team rushes to ship. "We'll figure out the monitoring later." The feature hits production and immediately works—until it doesn't. And then they're debugging a system they built fast but don't understand operationally.

Both patterns come from treating build and run as a single job. They're related, but the skills and tools are different.

Build Discipline: Architecture, Validation, Experimentation. Run Discipline: Monitoring, Operations, Recovery.

Build and Run are complementary disciplines. Skipping either one creates a different kind of debt.

Why This Distinction Matters

Google's MLOps framework describes their Level 0 maturity as a state where "data scientists who create the model and engineers who serve the model as a prediction service" are disconnected. The people who build it don't understand how it runs. The people who run it didn't shape how it was built.

That disconnect causes real problems. A seminal paper from Google (Sculley et al.) demonstrated that ML code is a small fraction of a real-world ML system. The surrounding infrastructure—monitoring, data pipelines, configuration management, serving—dominates. If most of what matters in production is infrastructure, then running the system well deserves at least as much rigor as building it.

The paper also introduced the CACE principle: "Changing Anything Changes Everything." Change one input feature, and the weights and behavior of every other feature can shift. This means build decisions have run-time consequences that are hard to predict. The two disciplines are coupled—you can't do one well in isolation.

Closing this gap takes intentionality. Higher maturity levels in Google's framework progressively unify development and operations through pipeline automation and CI/
CD. The key insight: unifying build and run doesn't mean one person does both—though on smaller teams, having the same person handle both can actually reduce silos and
context loss. The point is that both disciplines get deliberate attention, whether that's one engineer wearing two hats or two teams with shared understanding.

Build Discipline

Build discipline is about the decisions and validation that happen before your AI feature serves a single real user. Three areas:

Architecture

How does the AI fit into your existing systems? For LLM features, this includes prompt design—the instructions that shape model behavior. It also includes vendor strategy: are you using an API or self-hosting? How locked in are you?

These decisions compound. A prompt architecture that works in development might not survive production-scale edge cases. A vendor choice that's cheap today might not scale. Architecture is where you make bets, and build discipline means making them deliberately.

Validation

Testing AI features goes beyond unit tests. You need eval suites that test the AI's behavior—does it handle adversarial inputs? Does it degrade gracefully on edge
cases?

A useful pattern: champion/challenger comparison. Before deploying a new model version, validate it against the current production model. Test for both sudden
degradation (a bug causing significantly lower quality) and slow degradation (performance drifting below a threshold).

Security review matters here too. Prompt injection is a real attack surface for LLM features. Validation before deployment should include adversarial testing, fairness checks, and integration testing to verify the AI interfaces correctly with upstream and downstream systems.

Experimentation

The gap between "works in development" and "works in production" is where experimentation frameworks earn their keep.

Shadow deployment lets a new model process real production traffic without serving predictions to users. It's the first time you see how the model performs on actual data—without risk. Canary releases then gradually route a small percentage of real traffic to the new model, with daily analysis before increasing. A/B testing compares the new model against the current champion with real users and real stakes.

Shadow deployment roughly doubles your infrastructure costs while running, since it duplicates all traffic. Canary releases and A/B tests add less overhead since they split rather than duplicate traffic. All three add operational complexity—a real trade-off. For most AI features, it's worth it—especially early on when you're still learning what production behavior looks like.

Run Discipline

Run discipline is about what happens after you ship. Three areas here too:

Monitoring

I think about monitoring in three layers, each answering a different question:

  • Monitoring tracks real-time metrics—latency, token usage, error rates, cost. It tells you: is the system healthy right now?

  • Evaluation measures quality—accuracy, hallucination rates, relevance scores. It tells you: are the outputs good?

  • Observability provides full request tracing. It tells you: when something goes wrong, why?

For LLM features specifically, there are gaps that traditional monitoring misses. Prompt-completion linkage (tracing a response back to its prompt context), embedding drift (changes in vector representations that signal semantic shifts), and cost tracking all require purpose-built tooling.

On cost: industry benchmarks suggest that unoptimized LLM applications can spend several times more than necessary on inference. Token-based pricing creates variable
costs that scale with usage in ways that traditional software doesn't. Monitoring cost per request, per user, and per feature is often where you start.

A practical note: if you're already using a monitoring platform like Datadog or New Relic, start there. Most have added LLM observability features. Consolidation tends to beat adding a new specialized vendor—the value is in having everything in one place, not in having the best-in-class tool for each concern.

Operations

AI incidents look different from traditional software incidents. The Coalition for Secure AI puts it well: failures look like behavior, not stack traces. The system is up. Latency is fine. But the outputs are wrong—and wrong in ways that are hard to detect automatically.

Containment options for AI failures include model rollback, traffic reduction, feature flags, circuit breakers, and in extreme cases, full shutdown. A useful principle from incident response frameworks: prioritize recovering to a safe state over recovering to a fast state. Get the system safe first, then optimize.

Operations also means defining who reviews AI outputs and when. Human-in-the-loop decisions depend on the cost of wrong answers, the reversibility of decisions, and whether you can filter to high-risk cases. This is a topic we'll explore further in a later post.

Post-incident reviews should happen within a week or two while details are fresh. Blame-free, focused on process improvement—the same as traditional incident
retrospectives, applied to AI-specific failure modes.

Recovery

When things go wrong, how fast can you get back to a good state?

Automated rollback rules that compare live metrics—quality scores, latency, error rates, cost per request—against the previous model version are the foundation. If those metrics drop below a threshold, the system reverts without waiting for a human. For LLM features, semantic fallbacks add another layer: alternative prompt formulations or validation-first retries when an output doesn't meet quality requirements.

Graceful degradation matters too. What happens when the AI service is completely unavailable? For features that augment an existing experience, the system should fall
back to a non-AI version rather than showing an error. For features where AI is the core experience—like a chatbot—the fallback might be a queued response or a maintenance message. Either way, the question is worth asking before you ship.

After any recovery, validate the fix: re-run the scenarios that triggered the incident, test boundary conditions, and confirm monitoring alerts fire correctly. Recovery without validation is just hoping the problem doesn't recur.

The Bridge: Deployment Checklists

If build discipline and run discipline are two sides of the same coin, the deployment checklist is the edge that connects them. It's the handoff point—a structured forcing function that ensures both disciplines are represented before you ship.

Pre-Ship

This is where the three foundational questions from the series overview become concrete:

Deployment checklists with three tabs: Pre-Ship

Three checklists that bridge build and run. I wouldn't want to ship an AI feature without working through these.

  • Can you answer Q1 (Should we build this?), Q2 (How will it fail?), Q3 (Can we afford it?)?

  • Failure modes documented?

  • Detection signals in place?

  • Fallback behavior defined?

  • Cost ceiling set?

  • Kill switch accessible?

  • Human escalation path exists?

  • Monitoring dashboard ready?

Each item touches both disciplines. "Failure modes documented" is build work. "Detection signals in place" is run work. "Fallback behavior defined" spans both—you
design it during build, but it executes during run.

Production Readiness

Health checks for AI features go beyond "is the service running." Is the model loaded and responsive? Are dependencies like vector databases and caching layers
available? Is the service performing within acceptable latency bounds?

Deployment checklist: Production Readiness

Data pipeline validation confirms data is flowing correctly and that environments are consistent between development and production. Runbooks for common failure
scenarios round this out—when an alert fires, what does the on-call engineer actually do?

Cost Control

For LLM features, cost control is a run-discipline concern that needs build-discipline decisions. Token budgets, per-request cost tracking, and budget alerts prevent
surprises. The build side sets the ceiling; the run side enforces it.

Deployment checklist: Cost Control

The Maturity Progression

Where does your team sit? Google's MLOps framework
describes three maturity levels (other frameworks like Microsoft's define more granular steps, but the progression is similar):

  • Level 0: Build and run are siloed. The team that builds the model throws it over the wall to the team that runs it. No shared tools, no shared understanding.

  • Level 1: The same pipeline works in development and production. Build and run share infrastructure. Automation replaces manual handoffs.

  • Level 2: Automated testing, validation, and deployment. Build and run are unified through CI/CD. Changes flow through both disciplines automatically.

The gap between "we built a model" and "we operate a production AI feature" is where a lot of value gets lost. The goal isn't reaching Level 2 overnight—and these
maturity levels are a reference point, not a prescription. A three-person startup will move through them differently than a platform team at a large company. What
matters is being intentional about both disciplines: knowing where you're investing and where you're taking on risk by deferring.

What's Next

This post named the two disciplines that make up the Execution Layer. The next posts explore what happens when they're missing:

The AI Failure Cascade maps the path from user request to user impact, showing where things can break at every transition. It's what happens when run discipline
has gaps.

Systems That Don't Learn, Decay closes the loop—what happens when neither build nor run includes a mechanism for learning and improvement over time.

Both build from the framework here. If you know what build discipline and run discipline look like, you're better equipped to see where failures originate and where
learning should feed back.

Reply

or to participate.