- Prompt/Deploy
- Posts
- RAG vs Fine-tuning: How I'd Decide
RAG vs Fine-tuning: How I'd Decide
The most common architectural decision in AI features. An interactive decision tree to navigate the RAG vs fine-tuning choice.

This post is part of the Mental Models for Production AI series, which explores the mental frameworks needed to evaluate, build, operate, and improve AI-powered features—focusing on practical decision-making.
Every "RAG vs fine-tuning" article covers the same ground: RAG for dynamic knowledge, fine-tuning for behavior. That framing is useful as a starting point, but most real use cases are messier than that. Maybe your knowledge is semi-static. Maybe you need the model to deeply understand domain terminology and reference documents. Maybe you've tried both framings and neither quite fits.
In the previous post, we looked at data readiness — understanding what your data can and can't support. This post takes the next step: given your data, how do you decide what to do with it?
Here's the reasoning path I walk through when I'm making this decision.
Why This Decision Is Harder Than It Looks
The standard framing — "RAG for knowledge, fine-tuning for behavior" — works for textbook cases. It breaks down when your use case has both knowledge and behavioral requirements, which is most of the time.
A few things that make this tricky:
These approaches solve different problems. Prompting, RAG, and fine-tuning aren't sequential upgrades where you "level up" from one to the next. The argument that these aren't rungs on a ladder resonates — treating them as sequential upgrades creates brittle systems. They're different architectural choices for different types of problems, each with their own limitations and failure modes.
The question is usually "which combination," not "which one." Most production systems that get this right use some mix — fine-tuning for behavior, RAG for knowledge, prompt engineering for everything in between. The interesting question is where you draw those lines for your use case.
Prompt engineering handles more than people expect. Before reaching for either RAG or fine-tuning, it's worth exhausting what well-crafted prompts can do. Anthropic's prompt engineering documentation argues that prompt engineering is "far more effective than fine-tuning at helping models better understand and utilize external content such as retrieved documents." If good prompts get you where you need to be, the infrastructure investment of RAG or fine-tuning may not be justified.
The decision tree below assumes you've already tried prompt engineering and hit its limits. Five gates, evaluated in order. Each one narrows the path.
One thing to consider first: If your knowledge base is small enough to fit in a model's context window (even partially, with caching), you may not need a retrieval pipeline at all.
Long-context models combined with prompt caching can handle corpora that would have required RAG a year ago. The gates below
assume your data exceeds what context engineering alone can manage.

The full decision tree. Most paths exit toward RAG. Fine-tuning requires passing multiple gates.
How to Read This Tree
Five gates, evaluated in order. Each gate has paths that either continue deeper into the tree or exit to a specific recommendation.
Every terminal node includes a "what this means in practice" note — what you'll actually need to build if you land there.
Three highlighted paths to watch for:
RAG (most common) — the majority of production use cases exit here
Fine-tune (narrowest path) — requires specific conditions at every gate
Hybrid (the realistic middle ground) — fine-tune for behavior, RAG for knowledge
This tree is a snapshot. Revisit it when your data improves, your team grows, or your budget changes. The answer that's right today may shift in six months.
To make this concrete, I'll use a running example: an internal knowledge assistant that answers employee questions about company policies, product specs, and engineering practices. This use case deliberately sits in a gray area — some knowledge changes quarterly, some is stable for years, and the company wants responses in a consistent voice.
Gate 1: Is Your Knowledge Static or Frequently Changing?
How often does the information the model needs to work with actually change?
This is the first gate because knowledge dynamics tends to be the strongest signal. If your information changes weekly or monthly, you generally don't want those facts encoded in model weights.
Retraining takes hours to days and costs real money each time. Updating a retrieval index is usually faster — often minutes to hours depending on corpus size and pipeline maturity.
CHANGING → Lean Toward RAG
If product docs update with every release, if pricing changes seasonally, if policies get revised quarterly — RAG handles this naturally.

Dynamic knowledge needs dynamic retrieval.
For teams with fast-changing documentation — API references, product specs, pricing pages — RAG is the natural fit. You update the index when the docs change, and the model immediately retrieves the current version. No retraining, no waiting. The model's job is to reference current details, not "deeply know" them.
💡 What this means in practice: You'll need an embedding pipeline, a vector store (Pinecone, Weaviate, pgvector, or similar), and a retrieval layer. Budget for ongoing index maintenance and freshness monitoring. RAG at scale is a platform discipline — Barnett et al. identified seven distinct failure points in production RAG systems, from missed documents to incomplete extraction.
STATIC → Continue to Gate 2
If the knowledge is stable for months or years — medical guidelines, legal frameworks, domain fundamentals — fine-tuning stays on the table. But "static" alone doesn't mean "fine-tune." There are more gates to pass.
Running example
Company policies change roughly quarterly. Product specs change with releases every 2-4 weeks. Engineering practices are relatively stable. The dynamic parts clearly point to RAG. But the stable parts — domain terminology, engineering conventions, company voice — could go either way. Continue to Gate 2.
Gate 2: Does the Model Need to "Know" This Deeply, or Just Reference It?
This is the subtlest distinction in the tree. Does the model need to have internalized the knowledge — understanding patterns, relationships, and domain logic — or does it just need to look something up and present it?
"Reference" means the model retrieves context and uses it to answer. "Deep knowledge" means the model has absorbed patterns into its weights — understanding a query language's syntax, reasoning about domain-specific relationships, or knowing how to structure outputs in ways that go beyond what prompt examples can demonstrate.
REFERENCE → RAG
Most enterprise use cases are reference tasks. "What does our refund policy say?" "What's the API endpoint for user authentication?" "When was the last security audit?"
RAG handles these well. The answer exists in a document somewhere. The model's job is to find it, synthesize it, and present it clearly.
Notion's approach is a good example. As documented in a ZenML case study, Notion built extensive data infrastructure — CDC pipelines, Kafka, vector embeddings — to support RAG-based AI features across their workspace. Their AI needs to reference user content with proper permissions, not internalize it. Freshness and access control were the driving requirements.
⚠️ If you're building for internal use: Retrieval must be permission-aware. Your RAG pipeline needs to enforce the same access controls as the source systems — otherwise the model becomes a backdoor to documents users shouldn't see. ACL filtering, row-level security, and leakage testing aren't optional.
💡 What this means in practice: Same infrastructure as the Gate 1 RAG exit — embedding pipeline, vector store, retrieval layer. The key investment here is retrieval quality. As Barnett et al. found, many of the seven failure points they identified in production RAG systems sit in the retrieval and indexing layers — missing content, missed top-ranked documents, incomplete extraction — rather than in the LLM itself.
DEEP KNOWLEDGE → Continue to Gate 3
Sometimes the model needs to learn patterns that can't be easily retrieved. The clearest signal: you find yourself cramming more and more examples and rules into prompts, and they're getting unwieldy. The prompt is becoming a manual, and the model still doesn't quite get it.
There are also structural reasons to prefer internalized knowledge: latency constraints that rule out a retrieval round-trip, privacy requirements where sensitive data can't sit in a retrievable index, or consistent classification/transformation tasks where the model needs to map inputs to a domain-specific ontology reliably.
Honeycomb's natural language query assistant is a well-known example. As Hamel Husain documents, Honeycomb fine-tuned rather than using RAG because the model needed to internalize their query language's syntax and rules — a behavioral pattern that couldn't be reliably prompted or retrieved.
Running example
For policy questions — "What's our PTO policy?" — the model is referencing documents. That's RAG. For answering in the company's specific technical voice, using internal terminology correctly, and structuring responses in a particular way — that runs deeper. Continue to Gate 3 for those behavioral requirements.
Gate 3: Can You Create High-Quality Training Data?
Do you have — or can you create — enough high-quality input/output pairs to fine-tune effectively?
This is a hard gate. Fine-tuning without good training data produces results worse than not fine-tuning at all. In practice, training data availability is the most common blocker for fine-tuning projects.
"Quality" here means: accurate labels, diverse examples covering edge cases (not just happy paths), and data formatted the way you want the model's output to look. As Hamel Husain's fine-tuning course emphasizes, you also need a domain-specific evaluation harness — without it, you can't measure whether fine-tuning
actually helped.
NO → RAG (or Prompt Engineering)
If you can't produce quality training data, fine-tuning is off the table regardless of what the other gates say. This is a firm blocker.
That's a fine outcome. RAG with well-engineered prompts handles the majority of production use cases. As Hamel Husain puts it, "techniques like RAG work best to supply the model with context or up-to-date facts."
💡 What this means in practice: Invest in prompt engineering and RAG. Build the evaluation harness now anyway — it helps you measure whether your current approach is working, and it positions you to revisit fine-tuning later if your data situation improves.
YES → Continue to Gate 4
You have quality data — labeled examples, input/output pairs, or a reliable way to generate them at sufficient volume and diversity.
Running example
For the company voice requirement — do we have 500+ examples of "good" responses in our voice? If the team has been writing help articles for years in a consistent style, that's usable training data. If not, we're back to RAG + prompt engineering with style guidelines in the system prompt. For most teams, the honest answer is "we don't have this data yet."
Gate 4: Is the Task Format-Specific?
Does the model need to produce output in a specific, consistent format — structured data, domain-specific syntax, a particular UI rendering format, or a specialized tone that few-shot prompting can't reliably achieve? (When I say "fine-tuning" in this post, I primarily mean supervised fine-tuning — training on input/output pairs. Preference tuning and reinforcement learning from human feedback are related but different techniques with different cost/complexity profiles.)
Format specificity is a common reason teams consider fine-tuning — but it's worth exhausting alternatives first. Constrained decoding (structured output modes, JSON schemas), strict validators with retry loops, and tool-calling patterns can enforce format without training. For simple structures, these are often enough.
Fine-tuning becomes attractive when you still see unacceptable format drift under realistic input diversity — complex or idiosyncratic formats where prompts are brittle, the model drifts with varied inputs, and validation/retry costs add up. RAG retrieves content but gives you limited control over how the model presents it. Fine-tuning bakes format consistency into the model's behavior.
YES → Fine-Tuning Likely Worth It

Fine-tuning requires passing every gate. Format consistency is where it shows the clearest advantage.
The cases where this path makes sense:
Domain-specific output syntax — Honeycomb's query language, where the model needs to generate valid queries in a proprietary format
Structured UI rendering — ReChat's real estate assistant, where outputs mix structured and unstructured data for dynamic UI elements
Consistent structured outputs — JSON schemas, specific table formats, or API response structures where occasional format drift breaks downstream systems
Specialized reasoning patterns — In medical and legal domains, fine-tuned models have shown meaningful accuracy improvements over prompt-engineered general models on specialized tasks,
particularly where domain-specific reasoning patterns matter more than general knowledge
⚠️ Risk to account for: Fine-tuning can cause catastrophic forgetting — the model may lose general capabilities when trained on narrow data. This is a real operational risk. Monitor
general-purpose performance alongside your domain metrics after fine-tuning.
💡 What this means in practice: You'll need curated training examples (hundreds to thousands), an evaluation set, and a training/retraining pipeline. Budget for periodic retraining as
requirements evolve.
NO → Continue to Gate 5
The output format is flexible — natural language responses, general Q&A, conversational answers. In this case, the advantage of fine-tuning is less clear. You'd be fine-tuning for knowledge or general behavior, and RAG combined with good prompts typically handles that well.
Running example
The "company voice" requirement is real but moderate. It's a tone preference — consistent terminology, a certain level of technical depth — not a rigid output format. A well-crafted system prompt with style guidelines and a few examples can handle this. NO — continue to Gate 5.
Gate 5: Do You Have the Budget and Infra for Fine-Tuning?
Even if fine-tuning would technically help, can you afford it — the upfront cost and the ongoing maintenance?
Fine-tuning isn't just a training run. It's a commitment to maintaining training data, evaluation sets, and a retraining pipeline. When requirements change or data drifts, you retrain. That's an operational burden on top of the initial investment.
NO → RAG + Good Prompts
The compute costs have dropped significantly. API-based fine-tuning through OpenAI runs a few dollars per million training tokens depending on the model. Open-source alternatives (Llama with LoRA adapters) have brought per-run costs down further.
But compute is the cheap part. The real cost is engineering time: curating training data, building evaluation pipelines, debugging degraded performance, and maintaining the retraining cycle. ROI depends heavily on query volume, per-query savings, and how much engineering time the pipeline requires — there's no universal timeline.
One emerging middle ground: prompt caching (available from Anthropic and OpenAI) can reduce the cost and latency pressure that pushes teams toward fine-tuning. If your main motivation for fine-tuning is baking a long system prompt into the model's weights, cached prompts may get you similar economics without the training overhead.
Whether fine-tuning pays off depends on three things: per-request cost savings, quality uplift you can measure, and ongoing engineering labor for the pipeline. High-value workflows (medical diagnosis, legal analysis) can justify fine-tuning at surprisingly low volume. General-purpose chatbots need high volume to offset the maintenance cost. But for most teams, RAG + well-engineered prompts is the more practical path and handles the majority of production use cases.
💡 What this means in practice: Start with RAG and solid prompt engineering. This is where most teams land, and it's a strong foundation. You can always add fine-tuning later when volume justifies it and you've accumulated training data from production usage.
Running example
For our knowledge assistant: RAG for the dynamic knowledge (policies, product specs) combined with a well-crafted system prompt with style guidelines for the company voice. The behavioral requirements — consistent terminology, appropriate technical depth — are real but moderate enough for prompt engineering to handle. If the voice requirement grows more specific over time and the team accumulates good training data from production usage, fine-tuning becomes worth revisiting. For now, RAG + prompts.
YES → Fine-Tune, but Keep RAG as Complement

The hybrid path: fine-tune for behavior, RAG for knowledge. More complex, but the gains are real.
You've passed all gates: static knowledge needs, deep knowledge requirements, quality training data, and budget to sustain it. Even here, the recommendation is usually hybrid — fine-tune for behavior and format, use RAG for knowledge retrieval.
The evidence for this combination is strong. A study on agricultural domain tasks found that fine-tuning alone improved accuracy by about 6 percentage points,
RAG alone improved it by about 5 percentage points, and the gains were additive when combined — not redundant. Geographic knowledge tasks saw answer similarity jump from 47% to 72% with the hybrid approach. These numbers are from one domain; gains will vary depending on task type, data quality, and model choice — but the finding that improvements are additive rather than redundant is the key insight.
💡 What this means in practice: You'll maintain both an embedding/retrieval pipeline and a training pipeline. More operational complexity, but measurable performance gains. Fine-tune for the behavioral layer (tone, format, domain syntax), and use RAG for the knowledge layer (facts, policies, documentation).
The Path Most Teams Actually Walk
Looking at this tree, the most common exit is at Gate 1 or Gate 2 — straight to RAG. This is because the majority of production AI features are knowledge-reference tasks with information that changes over time.
The hybrid path (fine-tune + RAG) is the second most common, typically for teams with specialized domains where the model needs both domain behavior and current knowledge.
Pure fine-tuning with no RAG component is the rarest path. It requires static knowledge, deep internalization needs, quality training data, format-specific output, and budget — conditions that most teams don't have all at once.
A few patterns worth noting:
Prompt engineering tends to be underutilized. Before building either pipeline, exhaust what good prompts can do. Invest in prompt engineering, system prompts, and few-shot examples.
The hybrid approach has real evidence behind it. Beyond the agriculture study, an ACM RecSys paper documented live A/B experiments on a billion-user recommendation platform where a hybrid strategy — periodic fine-tuning combined with agile RAG updates — yielded statistically significant improvements in user satisfaction compared to
either approach alone.
Eugene Yan offers good advice on where to start: "Start by collecting a set of task-specific evals" before choosing between approaches. Let your evaluation results guide the decision, not intuition about which approach sounds better.
What Each Path Means You'll Build
RAG | Fine-Tuning | Hybrid | |
|---|---|---|---|
Infrastructure | Embedding pipeline, vector store, retrieval layer, chunking strategy | Training data pipeline, eval harness, fine-tuning compute, model versioning | Both of the above, plus integration layer |
Ongoing maintenance | Index updates, freshness monitoring, retrieval quality tuning | Training data curation, retraining triggers, forgetting detection | Both maintenance burdens |
Update speed | Minutes to hours (re-index) | Hours to days (retrain) | Minutes to hours for knowledge, hours for behavior |
Upfront cost | Lower | Higher | Highest |
Per-inference cost | Higher (retrieval step) | Lower (no retrieval) | Mixed |
Typical monthly cost | Varies widely by scale — dominated by embedding compute, vector storage, and retrieval volume | Compute is cheap per training run; engineering time for data curation and evals is the real cost | Highest total cost, highest performance ceiling |
Best for | Dynamic knowledge, reference tasks, frequently changing docs | Static behavioral patterns, format consistency, domain-specific syntax | Both knowledge and behavioral requirements |
Wrapping Up
Five gates. Each one narrows the decision based on concrete criteria — knowledge dynamics, depth of understanding needed, training data availability, format requirements, and budget realities.
The tree's value is in surfacing blockers early. No training data? Fine-tuning isn't an option. Knowledge changes weekly? Don't spend time debating fine-tuning. Budget tight? RAG + good prompts handles most cases and positions you to add fine-tuning later.
Most teams land on RAG. Teams with specialized domains tend toward hybrid. Pure fine-tuning is rare but powerful when conditions align — static knowledge, deep behavioral needs, quality data, format-specific outputs, and the budget to sustain it.
Next time someone asks "should we fine-tune or use RAG?" — walk the tree. The answer tends to be clearer than the debate suggests.
Next in the series: Build vs Buy tackles the other big architectural question — when to use vendor APIs versus building your own AI capabilities.
Further Reading
Anthropic Prompt Engineering Overview — Start here before reaching for RAG or fine-tuning
Is Fine-Tuning Still Valuable? — Hamel Husain on when fine-tuning actually helps
Seven Failure Points in RAG Systems — What breaks in production RAG (Barnett et al.)
RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study — The agriculture study showing additive hybrid gains
Practical LLM Patterns — Eugene Yan's comprehensive guide to production patterns
Reply