Prompt/Deploy
Posts
Why Content Generation Is the Wrong Goal for Technical Education

Why Content Generation Is the Wrong Goal for Technical Education

Agentic content systems for technical education should measure learning outcomes—and that requires designing evaluation before content.

Hou C.
February 04, 2026

In my experience building technical workshops for enterprise clients, I learned that the content that "sounds good" often fails in the room. Learners get stuck on step 3. Step 3 is correct. The problem is step 2, which assumed knowledge they didn't have.

An agent that generates technically correct content will make this mistake at scale. A system designed around assessments catches it.

This post is part of the System Design Notes: Agentic Content Platforms for Technical Education series. It makes the case for evaluation-first design in agentic education platforms.

The Promise and the Pitfalls

I've been involved in technical education long enough to recognize both the promise and the pitfalls of agentic AI in this space.

My perspective comes from both sides: 500+ students taught across bootcamps, enterprise workshops, and online courses, plus over a decade in software development building reliable and high-traffic consumer products. The engineering side is where I learned to think in failure modes and systems rather than features.

The promise is real. I've watched learners go from confused to capable in a single workshop. Agents that handle the repetitive parts of curriculum development—drafting exercises, checking code, adapting difficulty—could free educators to focus on the parts that require human judgment. That's worth building toward.

The pitfall is equally real. I've seen what happens when you optimize for the wrong thing. Content that looks polished but assumes the wrong prerequisites. Exercises that compile but teach the wrong mental model. Scale makes these mistakes worse, not better. Agents amplify whatever goal you give them—and if the goal is "generate content," the failure modes scale with the output.

This series exists to build toward the promise by designing around the pitfalls. The architecture starts with evaluation, not generation.

The Trap

The tempting first move is always content generation. I know because I've made it.

Prompt a model. Get a lesson script. Check that the code runs and the explanations are clear. You have output on day one. It feels productive—tangible artifacts, easy to demo, satisfying to stakeholders.

The metrics reinforce the trap.

Lessons generated per hour.
Topic coverage percentage.
Grammar scores.
Code compilation pass rate.

These numbers go up and to the right. Everyone feels good.

Here's the problem: none of these metrics measure whether anyone learned anything.

Generation quality is a proxy metric. When you optimize for the proxy instead of the outcome, you get a system that produces polished content that's harder to learn from than it looks. This is Goodhart's Law applied to education AI—"When a measure becomes a target, it ceases to be a good measure."

Consider what "quality" means in a generation-first pipeline:

# What a generation-first pipeline optimizes for

content_metrics = {

    "grammar_score": 0.98, # Looks polished
    "code_compiles": True, # Runs without errors
    "factual_accuracy": 0.95, # Passes automated fact-check
    "topic_coverage": "complete", # Every subtopic addressed
    "readability_score": 72, # Flesch-Kincaid approved
}

# What it doesn't measure

learning_metrics = {
    "learner_can_do_the_thing": "unknown", # The actual goal
    "prerequisite_assumptions_valid": "unchecked", # The silent failure
    "first_attempt_assessment_pass_rate": "not tracked", # The north star

}

Every metric in content_metrics can score perfectly while every metric in learning_metrics remains unmeasured. The pipeline ships. The learners struggle. The dashboard stays green.

Silent Failures

Hallucinations get the headlines. A model invents a library that doesn't exist. An explanation contains a factual error. These failures are visible—a reviewer can catch them, a linter can flag them, a learner can Google it and discover the mistake.

The failure mode that worries me most is quieter.

An agent generates a lesson on building a REST API with FastAPI. The lesson is technically correct. The code compiles. The explanations are clear.

But it assumes the learner already understands HTTP methods, JSON serialization, and Python type hints. A learner who doesn't know these gets stuck on step 3. They can't articulate why—they don't know what they don't know. They close the tab.

The content passes every automated quality check. The learner fails anyway. The system has no signal that anything went wrong.

Research on prerequisite knowledge gaps consistently shows that students suffer from "unknown unknowns"—they can't identify the knowledge they're missing, and the cognitive load of trying to formulate questions about concepts they don't understand prevents them from seeking help.

This is what I mean by a silent failure: technically correct content that assumes the wrong prerequisites. It's invisible to every quality gate except actual assessment of the learner.

And it compounds. A content pipeline generating 10,000 lessons carries the same prerequisite assumption across all of them. A human instructor adapts in real time—reads the room, backs up, re-explains. An agent pipeline doesn't. The same blind spot gets replicated at scale.

The failure cascade: a single prerequisite assumption propagates through a content pipeline, creating compounding silent failures at scale

A single unchecked prerequisite assumption propagates through a content pipeline. Each downstream lesson inherits the gap. Thousands of learners hit the same invisible wall.

Even the best-performing LLMs still hallucinate at low but non-trivial rates. In a pipeline producing thousands of exercises, even low single-digit error rates mean hundreds of incorrect items shipping to learners. And those are the detectable errors. The prerequisite gaps aren't detectable at all without assessing the learner directly.

The Curriculum Design Inversion

There's a framework for this problem. It's been around since 1998.

Understanding by Design, by Wiggins and McTighe, introduced what curriculum designers call backward design. The core idea: start from what learners should be able to do, then work backward to figure out what to teach them. Three stages:

1. Identify Desired Results — What should learners know, understand, and be able to do?

2. Determine Acceptable Evidence — What assessments prove the learning happened?

3. Plan Learning Experiences — Only now design the content.

Wiggins and McTighe also identified what they called the "twin sins" of traditional curriculum design:

- Activity-oriented design: "Hands-on without being minds-on"

—engaging experiences that lead only accidentally to insight or achievement.

- Coverage-oriented design: Marching through content without verifying understanding.

Both sins map directly to AI content generation. An agent producing technically correct lessons that march through a topic is committing the coverage sin at scale. An agent generating interactive exercises without alignment to learning outcomes is committing the activity sin.

The inversion applied to agent architecture looks like this:

Generation-first pipeline:

Topic → Content Agent → Assessment Agent → Ship

Inverted pipeline:

Learning Objectives → Assessment Design → Content Agent (constrained by assessments) → Ship

Left: the generation-first pipeline generates content first, then builds assessments around it. Right: the inverted pipeline defines assessments first, then generates content constrained by those assessments. The constraint flow changes everything.

The difference matters for agents specifically. An agent generating content with no constraints is an open-ended generation problem—hard to evaluate, easy to hallucinate, prone to prerequisite assumptions. An agent generating content to satisfy a specific assessment has a tighter, more verifiable goal. Constraints make agents more reliable, not less.

# What an evaluation-first pipeline optimizes for
pipeline_goal = {
    "north_star": "first_attempt_assessment_pass_rate",
    "content_constraint": "learner passes assessment without external help",
    "prerequisite_check": "assessment verifies prerequisites before lesson begins",
    "feedback_signal": "assessment failure triggers content revision",
}

When assessments come first, content agents have verifiable constraints. When content comes first, assessments tend to verify what was taught rather than what was learned.

What Scale Without Evaluation Looks Like

Two real-world examples show what happens when content generation scales faster than learning evaluation.

Duolingo: When Generation Outpaces Evaluation

Duolingo has been transparent about their architecture: a three-component pipeline of Generation, Evaluation, and Selection. The generation step produces multiple exercise variants. Evaluators check grammatical correctness, logical coherence, and pedagogical difficulty fit. Only variants that pass all evaluators move to selection.

Duolingo’s scale results were impressive—expansion from 300 to 15,000+ audio episodes across 25+ courses in under six months, with massive cost reduction. This is the promise of agentic content generation, and Duolingo built a more sophisticated pipeline than most.

But generation scaled faster than evaluation could keep up. As one report documented, their automated system shipped an entire French course missing the words for "and" and "or"—a gap no evaluator caught because nobody was checking for it. As AI-generated lessons scaled, observers noted that "the culturally textured flavor of Duolingo's exercises began fading" and "some exercises felt flat or culturally awkward." The evaluators might have caught grammar and coherence but not cultural nuance.

This is instructive for this post's argument. Duolingo built real evaluators—not a rubber stamp, a genuine evaluation-gated pipeline. And it still had blind spots. Maybe their evaluators targeted content properties: grammar, coherence, difficulty. What were not targeted were the silent failures described above: prerequisite gaps, cultural context, whether the learner actually retained the material. Even a sophisticated evaluation-gated pipeline has blind spots when evaluation targets content quality rather than learner outcomes.

Khan Academy

Khan Academy's efficacy data shows that in a study of roughly 350,000 students, recommended dosage (30+ min/week or 18+ hours/year) was associated with 20–30% higher-than-expected learning gains, with a longitudinal subset of around 221,000 students tracked over multiple years.

When researchers independently studied Khanmigo, the AI tutor specifically, they found no statistically significant difference between Khanmigo and Google Search for learning outcomes. Students perceived Khanmigo positively and appreciated its step-by-step guidance, but the measured learning gains were equivalent.

The bottleneck wasn't content quality. It was learner engagement with practice and assessment. As EdWeek reported, when students respond with "I don't know" or minimal effort, there's no reason to expect they'll learn—regardless of how good the content is.

Both platforms demonstrate the same lesson: evaluating content isn't the same as evaluating learning. Duolingo's evaluators ensure content quality; Khan Academy's data shows quality content doesn't guarantee outcomes. The gap between "good content" and "learner actually learned" is where evaluation-first design lives.

What About...

Two objections come up whenever I present the argument that evaluation should target learner outcomes rather than content quality.

"This is just waterfall thinking applied to AI." That's a fair concern, but the dependency order isn't the same as a linear process. Each stage—objectives, assessments, content—is iterative internally. You revise assessments as you learn what learners struggle with. You revise content when assessment data shows it isn't working. What's not iterative is the dependency order: objectives constrain assessments, assessments constrain content. That's not waterfall. That's having a spec before you write the implementation.

"LLMs are getting good enough that evaluation-first is overkill." Better models do make generation easier. They also make evaluation harder. A hallucinated technical explanation from GPT-3.5 was easy to catch—it sounded off. A hallucinated explanation from a frontier model sounds right. It passes human review. It passes automated checks. The only thing that catches it is assessing whether the learner actually learned the correct concept. Better models are an argument for evaluation-first, not against it.

Three Questions Before Building

Before building any agentic content system, three questions shape the entire design. These parallel the three questions I use for any AI system, adapted for education specifically.

"Should we build this?"

Yes—but the goal matters.

If success is measured by "lessons generated per month," the metric won't tell you whether anyone learned. A better goal: learners who can demonstrate competency they didn't have before.

I believe that the North Star Metric for an agentic education platform should be First-Attempt Assessment Pass Rate. How often do learners pass an assessment on the first try, without external help? This single metric forces the entire pipeline to care about prerequisites, difficulty calibration, and content clarity—because if any of those fail, the pass rate drops.

I’m aware that FAAPR alone can reward teaching-to-the-test—optimizing for assessment passage without deeper understanding. That’s why complementary metrics like delayed retention, transfer to novel problems, and time-to-competency matter too. But FAAPR is the forcing function that makes the pipeline care about learner outcomes at all, which is a prerequisite for measuring anything else.

"How will it fail?"

The failure mode that worries me most: content that looks correct while learners don't learn. This passes every automated check—grammar, factual accuracy, code correctness. Only direct assessment of the learner detects it.

There are also the visible failures. Even the best-performing LLMs still hallucinate at low but non-trivial rates, and at pipeline scale, even small error percentages compound into hundreds or thousands of incorrect exercises. Package hallucinations —where code-generating models recommend libraries that don't exist—are a particular risk in technical education.

"Can we afford it?"

Every LLM call, every evaluation run, every human-in-the-loop review has a price. That price compounds at scale.

GPT-3.5-class inference costs fell 280-fold between November 2022 and October 2024. Models are cheap. But the realistic prompt-to-usable-output ratio is roughly 10:1—you need multiple generation attempts, evaluation passes, and revisions per final piece of content. Real costs are an order of magnitude higher than per-token pricing suggests.

Without evaluation gates, you're paying to scale mistakes. With evaluation gates, you're paying more per unit of content, but each unit works better. Cost modeling threads through this entire series—every architectural decision has a price, and we'll address it honestly rather than hand-wave it.

Good vs bad agent goal framing: comparing "generate 100 lessons/week" with "produce learners who pass assessments at >80% first-attempt rate"

The goal you set determines the system you build. A generation target produces a content factory. An outcome target produces a learning system.

What This Series Builds

This is a series about building a system that produces measurable learning outcomes—where evaluation is the core, not a quality gate layered on top.

What changes when you adopt this framing:

Evaluation is the primary system. Content agents exist to serve it, not the other way around. The evaluation pipeline runs whether or not new content is being generated, because it's measuring learner outcomes continuously.

Content agents receive constraints, not just topics. They get learning objectives and assessment criteria as input. "Generate a lesson on FastAPI" becomes "Generate a lesson that prepares learners to pass this specific assessment, given these verified prerequisites."

The North Star Metric is learner performance. First-Attempt Assessment Pass Rate, not content volume or quality scores. Complementary metrics like delayed retention, transfer to novel problems, and time-to-competency matter too.

Cost modeling includes evaluation. The evaluation pipeline is a first-class cost center, not overhead.

Here's where the series goes:

Defining learning objectives that constrain agent behavior
Assessment design as the evaluation backbone
Content generation agents, constrained by objectives and assessments
The evaluation pipeline
Human-in-the-loop and quality gates
Cost modeling and operational reality

Every architectural decision in this series flows from one principle: design the evaluation system first, then build content agents that satisfy it.

The Inversion as First Principle

The curriculum design inversion—learning objectives → assessments → content—has decades of academic backing and production evidence behind it. Backward design is well-established in education.

What's new is applying it to agent architecture. The inversion changes what agents optimize for, how pipelines are structured, and what "done" means. "Done" means learners pass assessments, not that content has been generated.

Reply

or to participate.