Prompt/Deploy
Posts
System Design Notes: Agentic Content Platforms for Technical Education — Series Overview

System Design Notes: Agentic Content Platforms for Technical Education — Series Overview

Designing an agentic content platform for technical education from the ground up — starting with evaluation, not generation

Hou C.
February 10, 2026

This series designs an agentic content platform for technical education from the ground up — starting with evaluation, not generation. Each post addresses one layer of the system: learning objectives, rubric evaluation, agent pipelines, orchestration, drift detection, teaching feedback, and interactive learning.

The Central Design Principle

The architecture starts from a single premise: the goal of an agentic content system for education is measurable learning outcomes, not content volume or generation quality. This is backward design, borrowed from curriculum theory (Wiggins and McTighe's Understanding by Design framework). The sequence is: define learning objectives, design assessments that measure those objectives, then build instruction that targets those assessments. Content generation comes last, constrained by everything before it.

The Narrative Arc

The seven posts in this series follow four phases that mirror how a production system is built.

Phase 1: Define the Goal (Post 1)

Why Content Generation Is the Wrong Goal for Technical Education argues the correct target is measurable learning outcomes, introduces backward design from curriculum theory, and establishes FAAPR as the metric every subsequent design decision optimizes for.

Phase 2: Build the System (Posts 2–4)

These three posts form the core content pipeline: define quality, generate content, run it reliably.

Evaluation-First Agent Architecture (Post 2) builds the rubric system. Five dimensions (Technical Correctness, Conceptual Clarity, Cognitive Load, Prerequisite Alignment, Code Executability), each with calibrated thresholds. The post walks through judge calibration using Cohen's kappa and HITL routing based on confidence bands.

The Multi-Agent Course Artifact Pipeline (Post 3) designs the six-agent pipeline. Each agent maps to a curriculum team role with constrained tools and isolated evaluation. The post covers why separation of concerns matters for debuggability, how shared state contracts prevent agents from overwriting each other, and why RAG typically outperforms fine-tuning for educational content where knowledge changes frequently and attribution matters.

Stateful Orchestration for Reliable Course Production (Post 4) handles what happens when agents fail. Event-driven coordination with idempotency keys, a failure escalation ladder (retry → auto-fix → dead-letter queue), content versioning through event sourcing, and cost modeling showing retries can add overhead. The post also covers build-vs-buy trade-offs for orchestration infrastructure.

Phase 3: Sustain the System (Posts 5–6)

Production systems degrade. These posts address what happens after you ship.

Curriculum Drift Detection (Post 5) identifies five distinct ways published content decays and designs detection systems for each. The post establishes maintenance cadence (daily through quarterly) and a three-tier prioritization system for remediation.

Teaching as System Observability (Post 6) adds the instructor as a quality gate that catches failures automated evaluation misses. Teaching is performance-oriented — ordered delivery forces the instructor through content sequentially, exposing ordering errors and assumed knowledge that document review overlooks. The post maps instructor feedback into the system as structured events and designs the rubric refinement loop where learner signals update evaluation criteria over time.

Phase 4: Extend to Learners (Post 7)

Adaptive Feedback Agents for Interactive Technical Learning (Post 7) moves from content production to learner interaction. Feedback agents provide graduated hints calibrated to error type (syntax errors escalate faster than logic errors), with guardrails against prompt injection and solution leakage. The cost model demonstrates that real-time adaptive feedback is economically viable. Human escalation triggers cover cases where agents reach their limits: repeated failures at the SOLUTION level, off-scope questions, or frustration signals without progress.

The Full System Architecture

The diagram below shows how the seven posts in this series fit together as a system. Content flows forward (top to bottom) through generation. Quality signals flow backward (bottom to top) through evaluation refinement.

The key insight: this is a closed loop, not a linear pipeline. Learner interactions in Post 7 feed data back to drift detection in Post 5, which triggers updates through orchestration in Post 4, which re-runs the pipeline in Post 3 against evaluation criteria from Post 2.

Here's what each layer does:

The Goal (Post 1) defines learning objectives and the North Star Metric. Everything downstream serves this layer. The three foundational questions — should we build this? how will it fail? can we afford it? — scope every decision that follows.

Evaluation (Post 2) operationalizes quality through five rubric dimensions with calibrated thresholds. A three-layer evaluation stack (deterministic checks → LLM-as-judge → human review) routes content based on judge confidence, measured by Cohen's kappa. The system asks "will learners pass the assessment?" rather
than "is this content good?"

Agent Pipeline (Post 3) mirrors a curriculum team's division of labor: six specialized agents (Objective Interpreter, Content Drafter, Code Validator, Pedagogy Reviewer, Safety Checker, Publishing Gate), each with constrained tools, isolated evaluation, and explicit read/write permissions on shared state.

Orchestration (Post 4) provides the reliability layer — event-driven state management with idempotency keys, failure isolation through a retry-then-dead-letter escalation ladder, content versioning via event sourcing, and observability across the full pipeline.

Maintenance (Post 5) catches content decay after publication through five drift signals: dependency outdated, code sample fails, link broken, high learner confusion, and assessment pass rate drops. A tiered prioritization system routes each signal to the appropriate remediation strategy.

Teaching Feedback (Post 6) adds the instructor as an observability layer. Teaching surfaces ordering errors, assumed knowledge, and pacing issues that automated evaluation and reading review miss. Instructor annotations flow back as content update events, and learner signals refine the rubric itself.

Interactive (Post 7) is the system's direct interface with learners. Feedback agents provide graduated hints (NUDGE → HINT → GUIDED → SOLUTION) with guardrails against prompt injection and solution leakage, plus a cost model showing real-time feedback is viable at roughly $1.50 per 100-learner cohort.

How the Components Connect

The diagram shows edges between layers, but three feedback loops deserve closer attention. They're what make this a system rather than a sequence.

Loop 1: Learner Signals → Drift Detection → Rubric Refinement

Post 6's teaching feedback generates quality signals — confusion patterns, prerequisite gaps, completion rate drops. These feed into Post 5's drift detection as leading indicators. When patterns persist across cohorts, they trigger rubric refinement in Post 2: updating thresholds, adding dimensions, or recalibrating judges. This loop means the system's definition of quality improves with each cohort of learners.

Loop 2: Interactive Data → Maintenance → Pipeline

Post 7's feedback agents generate learner interaction data: hint escalation rates, common error patterns, time-to-resolution by exercise. This data flows to Post 5's drift detection. When an exercise's hint-to-solution ratio spikes — learners increasingly need GUIDED or SOLUTION-level help where NUDGE used to suffice — it signals content drift. The exercise re-enters Post 3's pipeline through Post 4's orchestration, carrying the error pattern data that informs the revision.

Loop 3: Drift Detection → Orchestration → Re-evaluation

Post 5 detects drift. Post 4's orchestration creates update events with metadata about the triggering signal. The pipeline (Post 3) re-generates affected content. Post 2's evaluation gates the output. The loop closes. Content versioning (Post 4's event sourcing) means every revision is traceable back to the drift signal that triggered it.

How I'd Actually Build This

The series reads sequentially, but nobody should build it that way. Build iteratively — identify small, independent pieces that demonstrate value on their own before wiring them together. Start with a single course, a minimal rubric, and one generation agent. Ship something. Let evidence from what's working (and what isn't) determine when each additional layer earns its place. Implementation sequencing is outside the scope of this series; the posts are structured to explain the architecture, not to prescribe a build order.

Conclusion

This overview is the map. The seven posts are the territory.

The architecture described across this series forms a closed loop where quality signals from learner interaction flow backward through every layer, continuously improving content and the evaluation criteria that define quality. That closed loop is the central architectural insight: content generation is a means, not an end, and every layer exists to serve measurable learning outcomes.

The most important design decision in this system has nothing to do with which model to use or how to orchestrate agents. It's deciding that the goal is learning outcomes, not content generation. Everything else follows from there.

Reply

or to participate.