Prompt/Deploy
Posts
Teaching as System Observability

Teaching as System Observability

Automated evals catch what's measurable. Teaching reveals what's hard to explain. Design the quality gate that catches what both miss.

Hou C.
February 09, 2026

I've led numerous bootcamps and enterprise workshops. Every time I teach, I find something the curriculum missed — a step that assumes knowledge the learner hasn't encountered yet, a transition that feels abrupt when spoken aloud. Teaching surfaces gaps that reading doesn't catch. For agentic content systems, the instructor provides observability through a mode of review — time-ordered, audience-reactive delivery — that typical review workflows underrepresent.

This post is part of the System Design Notes: Agentic Content Platforms for Technical Education series. Evaluation-First Agent Architecture for Learning Outcomes built the evaluation rubric — five dimensions, calibrated judges, threshold-based HITL routing. Curriculum Drift Detection: Keeping Technical Content Correct built drift detection — five signals for catching content decay after publication. This post adds the layer between those systems and the learner: the instructor, whose act of teaching generates quality signals that neither automated evals nor reading reviews produce.

This post covers:

Four quality gate layers, each catching different failure modes
Teaching as system debugging — borrowing from observability-driven development
The complete feedback loop from learner signals back to content improvement
Learner signal data readiness — how reliable are those signals as data?
What the instructor layer provides that feedback agents can't yet match

The Trap

The tempting first move: review content by reading it. Read the lesson, check that the code runs, verify the prerequisites are listed. It looks solid on paper.

Reading doesn't catch ordering errors — an explanation that references a concept from a lesson that comes later in the sequence. Speaking the content aloud is one of the highest-signal ways to surface that failure, because delivery forces strict sequential traversal. Prerequisite graphs and dependency metadata can catch some ordering issues earlier, but teaching is often where you notice the human version of the bug — the moment a learner has no mental foothold for what comes next. You can't skip ahead when you're presenting to a room.

If the person who designed the curriculum can't catch ordering errors by reading, an automated eval has even less chance. Some failures surface most reliably when you perform the content.

Four Quality Gate Layers

Content quality benefits from multiple layers of review. Each layer catches different failure modes, and the layers are complementary rather than redundant.

Quality gate layers diagram showing four layers from bottom to top: automated evals, human reviewer, instructor, learner

Four layers, each catching what the layer below misses. The instructor layer is the last chance to catch issues before learners encounter them.

Layer 1: Automated evals. Deterministic checks and LLM-as-judge scoring — the rubric dimensions from Evaluation-First Agent Architecture for Learning Outcomes. Code executability, prerequisite alignment, technical correctness, conceptual clarity, cognitive load. These catch what's testable: broken code, missing prerequisites, and some categories of factual errors — though LLM judges are probabilistic and calibration-dependent.

Layer 2: Human reviewer. Domain experts who evaluate what automated systems can't reliably judge. HITL routing sends failures to the right specialist — technical correctness failures to SMEs, cognitive load failures to instructional designers. Human reviewers catch subtle issues: a technically correct explanation that teaches the wrong mental model, or an exercise that's too similar to the worked example.

Layer 3: Instructor. The person who delivers the content. Teaching reveals what both automated evals and reading reviews miss — ordering problems, assumed knowledge, pacing issues that only surface during delivery. This is the layer this post focuses on.

Layer 4: Learner. The ultimate evaluator. When learners signal confusion, drop off, or fail assessments, that's real data. But by this point, the problem has already reached production. Every layer above exists to catch issues before learners encounter them.

Why Reading Falls Short

Reading evaluates content as a document. Teaching evaluates content as a performance — a sequence of steps delivered in real time, where each step depends on everything that came before it.

Three categories of issues that reading tends to miss:

Ordering errors. An explanation references a concept that the learner won't encounter until a later lesson. When reading, your eyes can jump ahead — you already know what "HTTP status codes" means. When teaching, you hit the reference in sequence and realize the learner has no context for it yet.

Assumed knowledge. A step that looks reasonable on paper might rely on knowledge the target learner doesn't have. For example, a lesson on Flask lesson might fail Prerequisite Alignment because it assumed learners knew HTTP status codes. Reading the lesson, this assumption is invisible — the explanation is clear if you already know what 404 means.

Pacing issues. A section that reads smoothly in two minutes might cover three distinct concepts that each need time to land. The density is invisible when reading but obvious when speaking — you feel yourself rushing, and the rush is a signal.

These connect to two systems already built in this series: This post 's evaluation rubric catches prerequisite alignment and cognitive load when properly calibrated, and this post's drift detection catches downstream effects (learner confusion, pass rate drops). The instructor layer catches these issues upstream, before they reach either system.

Teaching as Debugging

Automated evals answer the question of "How will I know this lesson works?" for measurable properties — code runs, prerequisites align, factual claims check out. Teaching answers it for properties that resist measurement.

What teaching surfaces that reading can't:

Hard-to-explain steps. You stumble while speaking them. The stumble is a signal — if you, the subject matter expert, struggle to explain a step clearly, the written version is hiding a clarity problem.

Abrupt transitions. The jump between sections feels wrong when you perform it. There's a cognitive gap the reader might not notice but the speaker does — a missing bridge between "here's the concept" and "now apply it."

Missing context. You instinctively add context that isn't in the written material. "Oh, you'll also need to know that..." — that addition reveals a gap in the content. If you're adding it live, it should have been written.

Your own discomfort. A subtler signal. You feel uneasy about a section without being able to articulate why. In my experience, that unease is often the first indicator of missing scaffolding — a gap you can't name yet but your delivery
instinct has already flagged.

Recording the teaching session turns delivery into a replayable debugging artifact. You can review where you stumbled, where you added unscripted context, where the pacing felt off. The recording captures the instructor's real-time experience — a signal source that automated systems don't currently generate. Recordings are most useful when timestamped and aligned to lesson steps, so you can jump to the moment an instructor stumbled rather than scrubbing through hours of video.

Content that looks fine on paper before teaching

Issues revealed during the act of teaching

During teaching: "Wait — did I explain dictionary bracket notation? The learner hasn't seen x['key'] syntax yet."

After: Step 2.5 added, explaining dictionary access before it's used.

The Feedback Loop Architecture

The instructor layer generates value in two directions: upstream (catching issues before learners see them) and downstream (feeding signals back into the content improvement pipeline).

The complete loop:

Agents generate content → through the agentic content pipeline
Rubric evaluates → through five dimensions
Instructor teaches → surfaces issues during delivery
Learners experience → generate confusion, question, and performance signals
Drift detection catches → through five signals
Rubric refines → thresholds and calibration update based on what was missed
Agents regenerate → improved content enters the pipeline

The complete feedback loop from learner to content improvement

The loop closes: learner signals feed drift detection, which triggers rubric refinement, which constrains the next generation cycle. The instructor layer accelerates this loop by catching issues before learners encounter them.

The instructor's annotations — "this step was hard to explain," "learners will need more context here" — enter the system as update events in the event queue, routing through the agent pipeline for content revision. Both instructor annotations and drift-detected issues flow through the same event pipeline — but the event schema preserves provenance. ContentEvent carries an actor field ("human:instructor_jane" or "agent:drift_detector"), and production systems need this metadata for trust weighting, routing, and audit trails. An instructor note and an automated drift signal share the same schema, but they may require different review policies before triggering content regeneration.

Closed-loop learning analytics frameworks describe this pattern: use analytics to inform improvements, assess the impact of those improvements, and iterate. What the
instructor layer adds is speed — catching issues in a single teaching session rather than waiting for learner signal aggregation over weeks.

Learner Feedback as System Input

In Curriculum Drift Detection, we covered what learner signals detect and when to act on them — confusion as a drift signal, pass rate drops as a quality signal. This section covers how reliable those signals are as data, because acting on unreliable signals is worse than not acting.

Three signal types map to different content issues:

Signal Type	Content Issue	Example
Confusion clusters	Potential drift	Multiple learners stuck at the same step
Common questions	Content gaps	"What does 404 mean?" repeated across cohorts
Assessment failures	Content-eval misalignment	Pass rate drops without content changes

These signals are not diagnoses. The same completion drop can stem from platform regressions, cohort differences, or assessment design changes — not just content problems. Learner telemetry prioritizes investigation, not automatic attribution.

Learner Feedback Data Readiness

Before using learner signals to drive content changes, you can apply a data readiness framework — source, quality, lineage, privacy — to these signals:

Source. Where does feedback come from? Surveys are explicit but low-volume with selection bias. Implicit signals (time-on-task, replay rate) are high-volume but noisy. Support tickets are high-signal but very low-volume. Each source has different reliability and latency characteristics.

Quality. Research shows that self-reported and behavioral measures of the same construct often correlate weakly. Self-report taps "typical performance" while behavioral measures tap "maximal performance." Process data — time-on-task patterns, score report checking — predicts outcomes better than self-reported engagement. Triangulation across signal types matters more than volume within a single type.

Lineage. Can you trace a content change back to the learner signal that triggered it? When a lesson gets updated, can you show "this change was driven by confusion signals from N learners at step 3"? Without lineage, you can't evaluate whether signal-driven changes actually improved outcomes.

Privacy. Learner interaction data has PII implications — time-on-task patterns, struggle points, and feedback text can be personally identifying. Aggregation thresholds and anonymization matter before feeding signals into content improvement pipelines. Depending on your context, FERPA (education) and GDPR (EU learners) set legal floors for consent, retention, and data minimization.

from dataclasses import dataclass, field
from enum import Enum


class SignalSource(Enum):
    SURVEY = "survey"           # Explicit, low volume, selection bias
    IMPLICIT = "implicit"       # Time-on-task, replay rate — high volume, noisy
    SUPPORT = "support_ticket"  # High signal, very low volume


@dataclass
class LearnerSignals:
    """Aggregated signals from learner interactions with content."""
    lesson_id: str
    sample_size: int
    # Confusion signals → may indicate content drift
    confusion_rate: float
    avg_time_on_task: float     # Ratio vs expected (1.0 = on target)
    # Question signals → may indicate content gaps
    common_questions: list[str]
    # Failure signals → may indicate content-eval misalignment
    assessment_pass_rate: float
    completion_rate: float
    # Source metadata for triangulation
    signal_sources: list[SignalSource] = field(default_factory=list)


class LearnerSignalAggregator:
    """Determines when learner signals warrant content review."""

    def should_trigger_review(
        self, signals: LearnerSignals
    ) -> bool:
        """
        Conservative thresholds — tune based on cohort size
        and baseline metrics. Smaller cohorts need wider bands
        to avoid false positives from normal variance.
        """
        # OR logic: any single breach triggers review. For noisier
        # signal sources, consider AND logic or minimum sample_size
        # checks to reduce false positives.
        if signals.sample_size < 20:
            return False  # Too few learners for reliable signal

        return (
            signals.confusion_rate > 0.3       # >30% showing confusion
            or signals.completion_rate < 0.7   # <70% completing lesson
            or signals.assessment_pass_rate < 0.6  # <60% first-attempt pass
        )

    def identify_problem_areas(
        self, signals: LearnerSignals
    ) -> list[str]:
        """Maps signal patterns to investigation areas."""
        problems = []
        if signals.confusion_rate > 0.3:
            problems.append(
                "High confusion — check prerequisites "
                "and step complexity"
            )
        if signals.avg_time_on_task > 1.5:
            problems.append(
                "Taking too long — content may be too dense "
                "or missing scaffolding"
            )
        if signals.assessment_pass_rate < 0.6:
            problems.append(
                "Low pass rate — check alignment between "
                "content and assessment"
            )
        return problems

These thresholds (0.3 confusion rate, 0.7 completion rate, 0.6 pass rate) are starting points. A "30% confusion rate" in an advanced algorithms course might be healthy productive struggle; the same rate in an introductory Python course signals a
content problem. Calibrate against your baseline for each content type and difficulty level. For small cohorts, consider cohort-normalized baselines rather than absolute thresholds — a 30% confusion rate in a 15-person cohort may be two standard
deviations from noise.

What Agents Can't Do Yet

Agents can respond to coded signals — error types, attempt counts, time-on-task thresholds. The next post in this series covers how feedback agents use these signals to provide graduated hints during interactive exercises. That's real adaptation
within a constrained framework.

The instructor layer catches what structured signals miss:

Recognizing breakthroughs. A shift in posture, a change in the quality of questions, a sudden confidence in the learner's voice. Reinforcing that moment matters for retention, and standard sensors and interaction logs don't capture it reliably. There's no "learner had an aha moment" event to subscribe to.

Reading the room. Detecting confusion before it becomes an explicit signal. Body language, hesitation, the quality of a question — a confused "what?" versus a curious "what if?" Research on confusion detection shows that behavioral signals can detect learner confusion, but the instructor perceives it in real time with context that interaction logs can't match. Sensor-based detection (webcam pose estimation, speech prosody) can scale further but introduces privacy constraints and ethical review requirements that instructor observation avoids.

Pivoting the teaching approach. Feedback agents escalate hint levels — nudge, then hint, then guided solution. An instructor restructures the explanation entirely. "That analogy didn't land. Let me try a completely different approach." The distinction: feedback agents adapt within constrained frameworks (hint levels, error patterns). Instructors adapt the framework itself.

These capabilities are quality signals in their own right. The instructor's real-time observations during delivery — where they stumbled, where they added unscripted context, where the room's energy shifted — generate feedback that feeds directly into the content improvement loop. Micro-feedback patterns support this: lightweight, in-situ notes captured during delivery that become system input for the next revision cycle.

A caveat: instructor observations are biased. Teaching style, cohort composition, and fatigue all influence what an instructor notices and reports. Structured annotation — timestamped notes tied to specific lesson steps, tagged with a controlled vocabulary (ordering gap, missing prerequisite, pacing overload) — helps convert subjective impressions into machine-actionable signals. Multiple instructors teaching the same content provides triangulation.

Trade-offs: When to Use This Pattern

Instructor time is expensive. Teaching every piece of content as a quality gate doesn't scale — and it doesn't need to.

When to include the instructor layer:

High-stakes learning outcomes where content errors have real consequences
Complex multi-step procedures where ordering and pacing matter
Content that will be reused at scale — the investment amortizes across thousands of learners
New content types where automated evals haven't been calibrated yet

When to skip it:

Low-stakes content (quick tips, reference material)
Rapid iteration cycles where content changes weekly
Content covered by well-calibrated automated evals with κ > 0.8

The instructor layer is easy to skip because it's expensive and the content already looks good after automated eval and human review. The question is whether "looks good" is sufficient for your learners.

A middle path: record the first delivery as a teaching review, capture annotations, then skip the instructor layer for minor revisions until signals suggest re-review. This concentrates instructor time on initial quality gating and treats subsequent deliveries as monitoring rather than active review.

Closing the Loop

The instructor is the last observability layer before content reaches learners. Teaching surfaces ordering problems, assumed knowledge, pacing issues, and clarity gaps that automated evals miss and reading reviews skim over.

For agentic content systems, this layer is worth designing for. The instructor's annotations feed back into the content pipeline as update events — the same infrastructure that handles drift detection and automated remediation. The quality gate
layers (automated evals → human reviewer → instructor → learner) each catch different failure modes, and the system works best when all four are active.

Where does the instructor fit in your quality gate architecture?

Next in the series: The final post covers adaptive feedback agents — learner-facing agents that respond in real-time to code submissions with graduated hints and guardrails. Where this post addresses the instructor's role in content delivery, the next addresses the agent's role in interactive exercises.