• Prompt/Deploy
  • Posts
  • Evaluation-First Agent Architecture for Learning Outcomes

Evaluation-First Agent Architecture for Learning Outcomes

A practical architecture for LLM evaluation systems with calibrated judges, threshold-based routing, and dimension-specific HITL.

In curriculum design, you define learning objectives before writing instruction. The same principle applies to content agents: define success criteria first, design evaluators that measure those criteria, then build agents that target them.

This post is part of the System Design Notes: Agentic Content Platforms for Technical Education series. I argue that you shouldn't build a content agent without first building the evaluation system it targets—you can, but you risk optimizing the wrong behaviors.

The Evaluation-First Principle

In the previous post, I argued that optimizing for content generation volume is Goodhart's Law applied to education AI—metrics like "lessons generated per hour" go up while learning outcomes stay flat. The alternative: evaluation-first design, borrowed from curriculum theory (Understanding by Design). This post is the architectural deep-dive on how to build that evaluation system.

The typical approach looks like this:

  1. Build agent

  2. Evaluate outputs

  3. Iterate on agent

The inverted approach:

  1. Define success criteria

  2. Build evaluators that measure those criteria

  3. Build agents that target the evaluators

This mirrors curriculum design: objectives → rubric → instruction. The rubric exists before the lesson does. The same should be true for content agents.

At an edtech company, I designed AI-assisted workflow assessments and activities for the AI Fundamentals program. The hardest part wasn't generating content—it was defining "good enough." What makes an AI activity pedagogically sound? I had to work with instructional designers to create rubrics before I could evaluate anything. That experience directly informs this architecture.

This post covers:

  • Five rubric dimensions with thresholds and check types

  • A three-layer evaluation stack (deterministic → LLM → human)

  • Judge calibration methodology using Cohen's kappa

  • Threshold-based HITL routing

  • Offline vs online evaluation and feedback loops

  • Cost modeling for eval pipelines

Throughout this post, we'll follow a single lesson—"Building a REST API with Flask: Handling Errors"—through the entire evaluation pipeline. This Python lesson teaches try/except blocks and HTTP status codes. It will reveal how a lesson can pass four of five dimensions and still fail learners.

The Five Rubric Dimensions

Why five dimensions instead of twelve? More dimensions feels like better coverage, but each dimension requires calibration work. Dimensions you can't calibrate reliably—where the LLM judge barely agrees with human raters—add noise, not signal. Fewer dimensions, measured rigorously, beat more dimensions measured poorly.

Limiting the number of criteria to 5-7 maintains usability. Each dimension in this architecture can be calibrated to κ > 0.6—or defaults to human review if calibration fails. If you can't get humans to agree on a dimension, don't ask a model to judge it.

Dimension

Threshold

Check Type

Rationale

Technical Correctness

0.95

Deterministic + LLM

Factual errors destroy trust; highest bar

Conceptual Clarity

0.85

LLM-as-judge

Subjective but calibratable; allows stylistic variation

Cognitive Load

0.80

LLM-as-judge

More tolerance for edge cases; learner-dependent

Prerequisite Alignment

1.0

Deterministic

Binary: references known concepts or doesn't

Code Executability

1.0

Deterministic

Binary: code runs or it doesn't

Some thresholds are 1.0 (no tolerance) because they're binary checks. Code either runs or it doesn't. A lesson either references concepts covered in prerequisites or it doesn't. There's no "partially executable" code.

Graduated thresholds (0.80-0.95) allow for subjective variation. Two instructional designers might reasonably disagree on whether a lesson's cognitive load is optimal. The threshold accommodates that variance while still catching clearly problematic content.

A caveat on Cognitive Load: reliably quantifying cognitive load via automated methods remains an open research problem. This dimension will likely show lower κ scores than others and may default to human review more often. Include it if your calibration data supports it; drop it if κ stays below 0.6.

Choosing Your Dimensions

These five dimensions work for Python programming lessons. Your domain may need different ones. A cybersecurity lab course might prioritize "Environment Reproducibility"—can learners actually run the lab on their machines? An assessment-heavy course might add "Fair Difficulty Progression"—are questions calibrated to the content just taught? A video course might require "Accessibility Compliance"—captions, transcripts, alt text.

Three criteria help determine whether a dimension belongs in your rubric:

Can humans agree on it? If two instructional designers disagree 40% of the time on whether a lesson passes, your LLM judge can't do better. Some dimensions sound important but resist reliable measurement. "Learner motivation" matters, but can you define pass/fail criteria that experts consistently apply? If κ between human raters is below 0.6, the dimension isn't ready for your rubric—not because it doesn't matter, but because you can't measure it reliably.

Is it observable pre-publication? "Learner engagement" is critical for content quality, but you can't measure it until learners see the content. Dimensions must be evaluable from the content itself, not from downstream outcomes. This doesn't mean ignoring engagement—it means measuring proxy indicators (pacing, interactivity, worked examples) that correlate with engagement.

Does failure tell reviewers what to fix? A dimension that says "this content is bad" without indicating why forces reviewers to re-diagnose the problem. Each dimension should map to a specific type of fix. Technical Correctness failures go to SMEs. Cognitive Load failures go to instructional designers. Prerequisite Alignment failures go to curriculum designers. If a dimension failure could mean five different things, split it into more specific dimensions.

For the Flask lesson, Prerequisite Alignment is non-negotiable—technical education fails when learners lack required knowledge. But a creative writing course might not need this dimension at all, while adding "Stylistic Consistency" or "Voice Appropriateness" that would be meaningless for Python tutorials.

Running Example: Flask Lesson Scores

Let's see how our Flask error handling lesson scores:

Dimension

Score

Status

Notes

Technical Correctness

0.96

✓ Pass

Code runs correctly, Flask syntax accurate

Conceptual Clarity

0.87

✓ Pass

Clear explanation of try/except error handling

Cognitive Load

0.82

✓ Pass

Appropriate pacing, concepts build on each other

Prerequisite Alignment

0.0

✗ FAIL

Assumes learner knows HTTP status codes (404, 500) without prior coverage

Code Executability

1.0

✓ Pass

All code blocks execute successfully

This lesson passes 4/5 dimensions—it looks "good" by most metrics. But learners will hit a wall when the lesson references return jsonify(error), 404 without explaining what 404 means. The code is correct. The explanation is clear. The pacing is appropriate. And learners will still get stuck.

This is the failure mode that volume-focused generation tends to miss. A system optimizing for "lessons generated per hour" would publish this lesson and move on.

from dataclasses import dataclass
from enum import Enum
from typing import List

class CheckType(Enum):
    DETERMINISTIC = "deterministic"
    LLM_JUDGE = "llm_judge"
    HYBRID = "hybrid"  # deterministic + LLM

@dataclass
class RubricDimension:
    name: str
    threshold: float
    check_type: CheckType
    description: str

@dataclass
class ContentRubric:
    dimensions: List[RubricDimension]

    @classmethod
    def default(cls) -> "ContentRubric":
        return cls(dimensions=[
            RubricDimension(
                name="technical_correctness",
                threshold=0.95,
                check_type=CheckType.HYBRID,
                description="Factual accuracy and code 
correctness"
            ),
            RubricDimension(
                name="conceptual_clarity",
                threshold=0.85,
                check_type=CheckType.LLM_JUDGE,
                description="Clear explanation of concepts"
            ),
            RubricDimension(
                name="cognitive_load",
                threshold=0.80,
                check_type=CheckType.LLM_JUDGE,
                description="Appropriate pacing and 
complexity"
            ),
            RubricDimension(
                name="prerequisite_alignment",
                threshold=1.0,
                check_type=CheckType.DETERMINISTIC,
                description="References only covered 
prerequisites"
            ),
            RubricDimension(
                name="code_executability",
                threshold=1.0,
                check_type=CheckType.DETERMINISTIC,
                description="All code blocks execute 
without error"
            ),
        ])

The Evaluation Stack: Deterministic → LLM → Human

Not all checks are equal in cost or reliability. The evaluation stack orders checks from cheapest/fastest to most expensive:

Layer

Cost

Speed

Reliability

Examples

1. Deterministic

Lowest

Fastest

Highest

Code execution, prerequisite checking, format validation

2. LLM-as-judge

Medium

Medium

Calibration-dependent

Conceptual clarity, cognitive load, technical correctness

3. Human review

Highest

Slowest

Highest fidelity

Edge cases, calibration disagreements, novel content

This ordering matters. Filter obvious failures cheaply before expensive evaluation. Multi-tiered evaluation can reduce costs significantly—run lightweight checks first, reserve expensive GPT-4-level evaluation for content that passes basic checks.

Running Example: Flask Lesson Through the Stack

Watch how the Flask lesson flows through each layer:

1. Deterministic layer:

  • Code Executability: ✓ All code blocks execute

  • Prerequisite Alignment: ✗ Checker finds reference to "404 status code" without matching prerequisite in curriculum graph

  • Result: Failure detected at cheapest layer—no additional LLM calls needed for this dimension

2. LLM-as-judge layer:

  • Conceptual Clarity: 0.87 (passes threshold of 0.85)

  • Cognitive Load: 0.82 (passes threshold of 0.80)

  • Technical Correctness: 0.96 (passes threshold of 0.95)

3. Human review layer:

  • Routed for review because Prerequisite Alignment failed

  • The deterministic check caught what LLM-as-judge would have missed

The deterministic prerequisite checker compares concept references in the lesson against the learner's completed prerequisites in the curriculum graph. When it finds "HTTP 404" referenced but no prior lesson covering HTTP status codes, it fails immediately. No LLM call needed. No subjective judgment required.

In practice, "deterministic" here means rule-based, not trivially simple. Mapping natural language references ("HTTP 404") to curriculum graph nodes requires concept extraction—keyword matching at minimum, semantic embeddings for robust detection. The check is deterministic in that it doesn't require LLM judgment calls, but implementation still involves NLP preprocessing.

Content flows through deterministic checks first, then LLM evaluation, with human review catching failures at any layer.

Content flows through deterministic checks first, then LLM evaluation, with human review catching failures at any layer.

Judge Calibration Methodology

Note: This calibration workflow assumes you have 30-50 labeled examples per dimension. The next section covers where that data comes from and what can go wrong with it.

The goal: LLM judge scores should correlate with expert human judgment. Raw agreement percentage is misleading—two raters who both say "pass" 90% of the time will show 81% agreement by chance alone.

Cohen’s kappa (κ) adjusts observed agreement for chance agreement; common (but domain-dependent) interpretation bands are often attributed to Landis & Koch (1977):

κ Range

Agreement Level

Action

> 0.80

Almost perfect

Automate fully

0.60–0.80

Substantial

Automate with spot-checks

< 0.60

Moderate or worse

Human review required

These bands are guidelines, not a universal standard; acceptable κ depends on stakes and the cost of errors.

MT-Bench and Chatbot Arena show that strong LLM judges (e.g., GPT-4) can correlate well with human preferences and can approach human-level agreement on some comparative judging setups, while also exhibiting measurable biases (position, verbosity, self-enhancement).

Educational rubric evaluation is a different domain, but the core finding—that structured calibration dramatically improves LLM-human agreement—transfers. This requires structured calibration, not just a well-written prompt.

Calibration Workflow

  1. Create calibration set (30-50 examples per dimension)

  2. Have domain expert label with binary pass/fail + written
    critique

  3. Run LLM judge on same examples

  4. Compute κ per dimension

  5. Iterate on judge prompt until κ > 0.6

  6. For dimensions that won't calibrate, default to human
    review

ℹ️ Note: Model versions change. When the underlying LLM updates (GPT-4 → GPT-4 Turbo, etc.), re-run calibration to verify κ scores still hold. A model update can shift judge behavior enough to invalidate previous calibration.

Pass/fail is unambiguous. Written critiques force the rater to articulate reasoning, which improves both human and LLM consistency.

The Five Biases to Mitigate

Industry engineering write-ups (e.g., GoDaddy) and academic studies describe recurring judge biases and practical mitigations.:

Bias

Impact

Mitigation

Position bias

Order-dependent outcomes based on answer order

Randomize presentation order

Verbosity bias

Tendency to prefer longer responses

Explicit conciseness criteria in rubric

Self-enhancement bias

Tendency to favor its own outputs or same-family outputs

Use different model family as judge

Overly positive skew

Tendency toward favorable ratings

Require a brief justification tied to rubric criteria (and/or cite evidence from the text) before assigning a score.

Prompt sensitivity

Scores vary with minor prompt changes

Normalize against reference sets

These biases are documented in the LLM-as-a-judge literature; magnitudes vary by task, prompt, and judge model.

Position bias is a known failure mode of LLM judges: the same pair can receive different outcomes when you swap answer order, so randomizing order (or judging both orders) is a standard mitigation.

def calibrate_dimension(
    dimension: str,
    calibration_examples: List[LabeledExample],
    judge_prompt: str
) -> CalibrationResult:
    """
    Run calibration for a single dimension.
    Returns kappa score and recommendations.
    """
    human_labels = [ex.human_label for ex in calibration_examples]

    # Run LLM judge on each example (randomize order)
    llm_labels = []
    for example in calibration_examples:
        # Randomize to mitigate position bias
        shuffled = randomize_presentation(example)
        label = run_llm_judge(shuffled, judge_prompt)
        llm_labels.append(label)

    kappa = compute_cohens_kappa(human_labels, llm_labels)

    if kappa > 0.80:
        return CalibrationResult(
            dimension=dimension,
            kappa=kappa,
            recommendation="automate_fully"
        )
    elif kappa > 0.60:
        return CalibrationResult(
            dimension=dimension,
            kappa=kappa,
            recommendation="automate_with_spot_checks"
        )
    else:
        return CalibrationResult(
            dimension=dimension,
            kappa=kappa,
            recommendation="human_review_required"
        )

Choosing Your Judge Model

Which model evaluates your content matters as much as how you calibrate it. The choice involves trade-offs between cost, capability, and bias risk.

For Technical Correctness, smaller models often suffice—code either works or it doesn't, and even GPT-3.5 can verify syntax and catch obvious errors. For Cognitive Load, you need a model that understands pedagogy, not just syntax. The difference between "too many concepts" and "appropriate depth" requires nuanced judgment that smaller models struggle with.

Ensemble approaches—voting across three or more smaller models—often outperform single large models at lower cost. If two of three models flag a lesson for Conceptual Clarity issues, that's more reliable than a single GPT-4 judgment. The trade-off is complexity: you're managing multiple model calls, aggregation logic, and potentially different failure modes.

One critical constraint: if your content generator is Claude, don't use Claude as the judge. Same-family judges can exhibit self-enhancement bias; avoid using the same model family as both generator and judge. Use a different model family for judgment to avoid this self-enhancement bias.

Running Example: Flask Lesson Calibration

Suppose the Flask lesson's Conceptual Clarity score of 0.87 came from an LLM judge calibrated on 40 expert-labeled examples. That judge achieved κ = 0.74 on the Conceptual Clarity dimension—good enough for automation with spot-checks, which is why we trust the 0.87 score.

If that judge had only achieved κ = 0.52, we'd route all Conceptual Clarity evaluations to human review regardless of the score.

Each dimension maps to an automation decision based on its calibrated kappa score. Numbers provided are hypothetical.

Each dimension maps to an automation decision based on its calibrated kappa score. Numbers provided are hypothetical.

A common concern: LLM-as-judge sounds unreliable. But this architecture layers deterministic checks first—LLM judges only handle subjective dimensions, and only after calibration. Research shows GPT-4 achieves human-level agreement when properly calibrated. The question isn't "is LLM-as-judge reliable?" but "for which dimensions, with what calibration?"

Calibration Data Readiness

The 30-50 calibration examples per dimension are data—treat them accordingly. Before trusting your calibration data, ask four questions:

Source: Where do examples come from?

Source

Quality

Cost

Bias Risk

Expert-authored

Gold standard

High

Overrepresents "clean" content

Sampled from production

Realistic distribution

Low

May include errors

Synthetic

Variable

Low

Clusters around generator patterns

Each source has different bias profiles. Expert-authored examples may overrepresent "clean" content that production rarely sees. Production samples include the messy edge cases experts forget to create. Synthetic examples may cluster around patterns the generator learned rather than genuine variety.

Quality: Are labels consistent?

What's the inter-rater reliability before you use the examples to calibrate the judge? If two human experts disagree on 40% of examples, your LLM judge can't do better than 60% agreement with either one.

Garbage calibration data → garbage judge.

Binary labels with written critiques produce clearer, more consistent annotations than numeric scores without explanation.

Lineage: Can you version calibration sets?

When you retune thresholds, can you trace back to which calibration data drove the change? Version your calibration sets like you version code. When a threshold
adjustment causes problems, you need to understand whether the issue is the threshold or the calibration data it was based on.

Privacy: PII in learner content

If calibration examples include learner-generated content—code submissions, written responses, project work—they may contain PII. Names in variable strings. Emails in test data. Identifying information in comments. Anonymization before use in calibration isn't optional. It's a compliance requirement.

Threshold-Based HITL Routing

Not all failures route to the same reviewer. Technical Correctness failures need a subject matter expert. Cognitive Load failures need an instructional designer.
Routing content to the wrong specialist wastes their time and delays fixes.

Dimension-Aware Routing

The routing logic handles failures first:

  1. If any dimension score falls below threshold → route to specialist for that dimension

  2. If confidence is below 85% but no dimension failed → route to general review queue

  3. If content passed all dimensions → apply the configured publish strategy

This routing logic is intentionally simplified. Production systems would add severity scoring for multi-dimensional failures, reviewer workload balancing, and escalation timeouts. Those concerns are real but beyond this architecture overview—the key insight is dimension-aware routing, not the full queue management system.

import random
from dataclasses import dataclass
from enum import Enum
from typing import Dict, List

@dataclass
class EvalResult:
    content_id: str
    dimension_scores: Dict[str, float]
    passed: bool
    failed_dimensions: List[str]
    confidence: float

class PublishStrategy(Enum):
    AUTO_PUBLISH = "auto_publish"
    SPOT_CHECK = "spot_check"
    CONFIDENCE_GATING = "confidence_gating"
    COLD_START = "cold_start"

class HITLRouter:
    ROUTING_MAP = {
        "technical_correctness": "subject_matter_expert",
        "conceptual_clarity": "instructional_designer",
        "cognitive_load": "instructional_designer",
        "prerequisite_alignment": "curriculum_designer",
        "code_executability": "developer",
    }

    def __init__(
        self,
        publish_strategy: PublishStrategy = PublishStrategy.AUTO_PUBLISH,
        spot_check_rate: float = 0.10,
        confidence_threshold: float = 0.90,
        cold_start_review_count: int = 50,
    ):
        self.publish_strategy = publish_strategy
        self.spot_check_rate = spot_check_rate
        self.confidence_threshold = confidence_threshold
        self.cold_start_review_count = cold_start_review_count
        self._reviewed_count = 0

    def route(self, result: EvalResult) -> str:
        # Failed dimensions always route to specialist
        if result.failed_dimensions:
            primary_failure = result.failed_dimensions[0]
            return self.ROUTING_MAP.get(
                primary_failure,
                "general_review"
            )

        # Low confidence routes to general review
        if result.confidence < 0.85:
            return "general_review"

        # Content passed all dimensions—apply publish strategy
        return self._apply_publish_strategy(result)

    def _apply_publish_strategy(self, result: EvalResult) -> str:
        match self.publish_strategy:
            case PublishStrategy.AUTO_PUBLISH:
                return "auto_publish"

            case PublishStrategy.SPOT_CHECK:
                # Randomly audit X% of passing content
                if random.random() < self.spot_check_rate:
                    return "spot_check_review"
                return "auto_publish"

            case PublishStrategy.CONFIDENCE_GATING:
                # Only auto-publish if avg confidence exceeds threshold
                avg_confidence = sum(result.dimension_scores.values()) / len(result.dimension_scores)
                if avg_confidence >= self.confidence_threshold:
                    return "auto_publish"
                return "confidence_review"

            case PublishStrategy.COLD_START:
                # Human review first N lessons, then graduate
                self._reviewed_count += 1
                if self._reviewed_count <= self.cold_start_review_count:
                    return "cold_start_review"
                return "auto_publish"

In practice, combining automated judging with targeted human review can substantially reduce manual review volume while preserving quality—especially when you route only low-confidence or failed dimensions to humans.

Handling Multi-Dimensional Failures

The routing code takes failed_dimensions[0]—but what happens when three dimensions fail? Sending a lesson through specialist queues one at a time (curriculum
designer → instructional designer → instructional designer) wastes everyone's time when the content has structural problems.

Strategy

How It Works

Best For

Priority ordering

Route to highest-priority failure first (Technical Correctness > Cognitive Load)

Clear hierarchy where some failures are more fundamental

Severity weighting

Route based on how far below threshold (0.10 is worse than 0.80)

When margin matters more than category

Major rewrite threshold

3+ failures → "structural_review" queue instead of specialist

Avoiding iteration loops on fundamentally broken content

Priority ordering assumes some dimensions are more fundamental than others. A lesson with incorrect code (Technical Correctness failure) shouldn't get instructional design polish first—fix the code, then address clarity. A typical priority order: Technical Correctness → Prerequisite Alignment → Conceptual Clarity → Cognitive Load → Code Executability.

Severity weighting looks at how badly a dimension failed, not just which dimension. A lesson scoring 0.10 on Conceptual Clarity has different problems than one scoring 0.80. The 0.10 likely needs a complete rewrite of the explanation; the 0.80 needs minor clarifications. Routing both to the same queue ignores this distinction.

The major rewrite threshold catches fundamentally broken content. A lesson failing one dimension needs a targeted fix. A lesson failing four dimensions probably has scope or structural issues that no single specialist can resolve. Route it to a senior curriculum architect who can address the root cause rather than cycling it through specialist queues.

If the Flask lesson had failed Prerequisite Alignment AND Cognitive Load AND Conceptual Clarity, it shouldn't queue for curriculum designer → instructional designer → instructional designer. It should go directly to a senior reviewer who can assess whether the lesson's scope is viable or whether it needs to be split, restructured, or scrapped.

Iteration Limits: When to Stop Fixing

What happens when content fails, gets fixed, re-evaluated, and fails again? At some point, iteration isn't the answer—the content has structural problems that targeted fixes won't solve.

Track attempt count per content item:

  • 1-2 iterations: Normal revision cycle. Most content passes within two attempts.

  • 3 iterations on the same dimension: Escalate to senior reviewer. The dimension-specific fix isn't working.

  • 5+ total iterations: Flag for potential scope or structure issues. This content may need a rewrite, not more fixes.

Some lessons are fundamentally broken. A Python exercise trying to teach recursion, decorators, and generators in one lesson has a scope problem—no amount of "fixing" the Cognitive Load dimension will help. The lesson needs to be split into three lessons, not polished into one.

The iteration limit isn't about giving up on content; it's about recognizing when the fix strategy needs to change. Dimension-specific routing assumes the content structure is sound and only specific aspects need improvement. When that assumption fails, escalate to someone who can make structural decisions.

For the Flask lesson, imagine it failed Prerequisite Alignment three times—first for HTTP status codes, then for JSON serialization, then for Flask's application context. Each fix addressed one prerequisite gap, but new ones kept appearing. The problem isn't the reviewer's fix quality. The problem is the lesson scope: it assumes too much prior knowledge. A senior reviewer might decide the course needs an "HTTP Fundamentals" module before the Flask section, rather than patching prerequisites one at a time.

Final Review Strategies

What happens when content passes all five dimensions? The answer isn't always "auto-publish." Different contexts call for different levels of caution.

Strategy

Description

Pros

Cons

Best For

Pure auto-publish

All passes → publish immediately

Fastest throughput, lowest HITL cost

No safety net for edge cases

High-confidence systems with mature calibration

Spot-check sampling

Auto-publish but randomly audit X%

Catches drift, minimal latency

Issues found post-publish

Established systems needing ongoing validation

Confidence gating

Auto-publish only if avg confidence > threshold

Catches low-confidence passes

Adds latency for borderline content

Domains where "just above threshold" is risky

Cold start mode

Human review first N lessons, then graduate

Builds trust before automation

Slow initial deployment

New systems, new content types, new domains

Pure auto-publish is the default in this architecture. Content passes all dimensions, it publishes. This works when your calibration is mature and your thresholds are well-tuned. But it assumes the evaluation system catches everything that matters.

Spot-check sampling maintains a safety net without blocking throughput. Auto-publish 90% of passing content, randomly route 10% to human review. This catches
calibration drift—when the LLM judge's behavior subtly shifts over time—without requiring humans to review everything. The downside: issues found in spot-checks appear after publication.

Confidence gating treats "just barely passing" differently from "clearly passing." A lesson scoring 0.86 on Conceptual Clarity (threshold: 0.85) passes technically, but you might want human eyes on borderline cases. Set a higher bar for full automation—say, average dimension score > 0.90—and route borderline passes to review.

Cold start mode is appropriate when you don't yet trust your evaluation system. Maybe you're launching a new content type. Maybe you've significantly changed your rubric. Review the first 50 lessons manually, compare your judgments to the automated scores, and graduate to auto-publish once you've validated the system works.

These strategies aren't mutually exclusive. You might run cold start mode for a new content type while using spot-check sampling for established content. The key is
making the decision explicit and configurable rather than defaulting to auto-publish everywhere.

Running Example: Flask Lesson Routing

The Flask lesson's Prerequisite Alignment failure routes to a curriculum designer—not the SME, not the instructional designer. Why? The curriculum designer owns the prerequisite graph and can make structural changes to the learning path.

The curriculum designer reviews and chooses one of two fixes:

  1. Add "HTTP Status Codes" as a prerequisite lesson in the learning path

  2. Add a "What you need to know" callout box explaining 404/500 codes inline

After the fix, the lesson re-enters the evaluation pipeline. On second pass, all five dimensions pass. Now the publish strategy determines the next step:

Strategy

Flask Lesson Path

Pure auto-publish

Publishes immediately

Spot-check (10%)

90% chance: publishes immediately. 10% chance: routes to spot-check review, then publishes

Confidence gating (0.90)

Avg score: (0.96 + 0.87 + 0.82 + 1.0 + 1.0) / 5 = 0.93 → exceeds threshold → publishes

Cold start (50 lessons)

If this is lesson #23 → routes to cold start review. If lesson #51 → publishes

With pure auto-publish, the Flask lesson goes live immediately after passing. With confidence gating at 0.90, it still auto-publishes because its average score (0.93) clears the bar. But if the Cognitive Load score had been 0.80 instead of 0.82, the average would drop to 0.92—still above threshold, but barely. A more conservative team might set their confidence gate at 0.95, routing that lesson to review.

The strategy choice reflects organizational risk tolerance, not content quality. The lesson passed either way.

A fair concern: human review doesn't scale. Correct—and that's the point. The goal is to minimize HITL, not eliminate it. Auto-publish handles the easy cases (typically 80%+ of content). Human reviewers focus on edge cases and calibration disagreements, which is where their expertise matters most.

Offline vs Online Evaluation

Evaluation doesn't end at publication. Offline evals catch problems before learners see them. Online evals catch what offline evals missed.

Aspect

Offline Evals

Online Evals

Timing

Pre-publish

Post-publish

Checks

All five rubric dimensions, deterministic + LLM

Completion rates, time-on-task, feedback, support tickets

Signals

Automated quality scores

Learner behavior patterns

Goal

Don't publish bad content

Catch what offline missed

The feedback loop matters most. Online signals inform threshold adjustments and calibration refinement. If completion rates drop consistently at step 3 of lessons with certain characteristics, that pattern should influence offline evaluation criteria.

As Hamel Husain notes, "It is impossible to completely determine evaluation criteria prior to human judging of LLM outputs." Criteria drift is real. The calibration
process must be iterative.

Running Example: What If Offline Eval Missed It?

Imagine the Prerequisite Alignment threshold had been 0.9 instead of 1.0. The Flask lesson would have auto-published.

Online signals would have caught it:

  • Completion rate drops at step 3 — where learners first encounter return jsonify(error), 404

  • Time-on-task spikes — learners Googling "what is 404 status code"

  • Support tickets — "I don't understand why we're returning 404"

This is why binary dimensions (Prerequisite Alignment, Code Executability) have 1.0 thresholds. There's no "partially correct" for these checks. Either the prerequisite is covered or it isn't. Either the code runs or it doesn't.

Offline evaluation gates publishing; online signals create a feedback loop for continuous improvement.

Eval Pipeline Cost Modeling

Running LLM-as-judge at scale has real costs. Here's a back-of-envelope calculation:

Per-lesson costs:

  • 5 rubric dimensions × 1 LLM-as-judge call each = 5 calls per lesson

Per-module costs (20 lessons):

  • 20 lessons × 5 calls = 100 judge calls per pass

  • Average 2.5 iterations to pass = 250 calls per module

  • At ~$0.03/call = ~$7.50/module in eval costs alone

That's before content generation. A 10-module course costs ~$75 just for evaluation.

These figures are illustrative, not precise. Actual costs vary significantly by model choice, prompt length, and batch strategies. The structural point—tiered evaluation reduces costs by filtering before expensive checks—holds regardless of specific pricing.

Break-Even Analysis

At what point does HITL become cheaper than automated eval?

For low-volume content (a few modules per month), human review may be more cost-effective. A human reviewer at $50/hour evaluating 10 lessons/hour costs $5/lesson—comparable to automated eval after factoring in calibration overhead.

At scale (100+ modules), LLM-as-judge wins decisively. The fixed costs of calibration amortize across volume.

Cost Reduction Strategies

  1. Multi-tiered evaluation: Run cheap deterministic checks first. Only invoke LLM-as-judge for content that passes basic checks.

  2. Batch inference: Process lessons in batches rather than one-by-one for better throughput and lower per-token costs.

  3. Prompt caching: For workloads with large repeated prefixes, vendor caching can reduce input-token costs and latency (see OpenAI Prompt Caching docs; Claude prompt caching docs)

  4. Ensemble of smaller models: Panels of smaller models (command-r, gpt-3.5-turbo, haiku) via voting can outperform single GPT-4 evaluators at one-seventh the cost.

Trade-offs and When to Adjust

Thresholds aren't fixed. They're parameters you tune based on observed outcomes.

Tight Thresholds (0.95+)

  • Higher quality output

  • Slower throughput

  • More HITL load

  • Best for: High-stakes content, external-facing material, regulated contexts

Loose Thresholds (0.75-0.85)

  • Faster publishing

  • Lower HITL load

  • Higher risk of learner confusion

  • Best for: Internal content, rapid prototyping, low-stakes contexts

When to Adjust

  • Online signals show quality issues → tighten thresholds

  • HITL queue is backed up → consider loosening non-critical dimensions

  • New content type → recalibrate before adjusting thresholds

These thresholds aren't meant to be sacred—they're starting configurations. The architecture supports threshold adjustment without code changes. Start conservative, tune based on observed outcomes.

Running Example: Threshold Implications

The Flask lesson failed Prerequisite Alignment with a 1.0 threshold. With a looser 0.9 threshold, it would have auto-published—and learners would have been the ones to catch the gap.

This is the trade-off made explicit: tighter thresholds cost more HITL time, but looser thresholds cost learner trust.

The Foundation for Everything Else

The Flask lesson we traced through this post is now ready to publish. It was well-written (it was), but more importantly, the evaluation system caught what "well-written" misses: a prerequisite gap that would
have confused learners.

We built the evaluation system first. The lesson benefited.

Key principle: First, define your success criteria. Design evaluators that measure those criteria. Then—and only then—build agents that target them. Although you can build a content agent without first
building the evaluation system it targets, you risk optimizing the wrong behaviors.

Next in the series: How content generation agents are designed to target specific rubric dimensions, using the evaluation architecture we've built here as their optimization target.

Reply

or to participate.