Why I Think About Data Before Models

Explores the four quadrants of data readiness - source, quality, lineage, privacy - that determine what's possible with AI

In the previous post, we established three questions to ask before building any AI feature: Should we build this? How will it fail? Can we afford it?

The second question — How will it fail? — is where data enters the picture. In practice, RAG failures frequently trace to data preparation decisions — like how documents are chunked — rather than model choice or retrieval algorithms. I'm using "data" broadly here: content, metadata, index state, and the pipelines that keep them consistent.

The way I think about it: your data quality is your ceiling. I'd rather understand the ceiling before I start optimizing the room.

This post introduces a mental model for evaluating data readiness across four quadrants — Source, Quality, Lineage, and Privacy — before committing to a model approach. It applies to any AI feature that ingests, transforms, or retrieves data: RAG systems, classification pipelines, recommendation engines, AI-assisted search. The model provider and architecture matter, but they operate within the ceiling your data sets.

To be clear, this isn't "never touch a model until data is perfect." Perfect data doesn't exist. It's more like: understand what your data can and can't support, so you're not spending weeks debugging model behavior that's actually a data problem upstream.

The Four Quadrants of Data Readiness

I evaluate data readiness across four areas. Each one can independently block production readiness — a problem in any single quadrant is sufficient to degrade the entire feature.

The data readiness layer showing four quadrants: Source, Quality, Lineage, and Privacy

The four quadrants of data readiness. Each one gates a different class of production failure.

Here's the quick version of what each quadrant asks:

Quadrant

Core Question

Source

Where does this come from, and will it still be there tomorrow?

Quality

Is it clean enough for this use case?

Lineage

Can I trace how this answer was assembled?

Privacy

Would I be comfortable if this data handling were public?

They interact — a source instability problem often surfaces as a quality problem, and privacy constraints affect what data you can store, which directly shapes what lineage is possible. But I evaluate them separately because the mitigations are different.

Source: Where It Comes From, How Fresh, How Stable

Three questions to start with: Where does the data originate? How often is it updated? What happens when the source changes schema or goes down? Schema evolution is the quiet one — a source system adds a field, renames a column, or changes an enum value, and the downstream pipeline either breaks loudly or produces subtly wrong results. Versioned schemas and forward-compatibility strategies are worth establishing before they're needed.

Freshness vs. staleness

Every AI feature has a staleness budget — how out-of-date the data can be before outputs become misleading. The tricky part is that tightening the budget gets expensive fast. Moving from a 24-hour data freshness SLA to sub-minute freshness increases operational complexity substantially — streaming pipelines, backfill handling, deduplication, and monitoring all come into play. That trade-off tends to make sense only when staleness becomes a measurable business problem — support tickets about outdated pricing, compliance gaps, users making decisions on stale information.

For a lot of internal tools, a nightly sync is fine. For customer-facing product search, it probably isn't. The right answer depends on the cost of a stale result in your specific system.

The orphaned vector problem

Here's a failure mode that's easy to miss: when documents are deprecated, batch pipelines may remove them from the source system but leave orphaned vectors in the index. For days, users can get search results pointing to content that no longer exists. It requires explicit deletion tracking and cleanup jobs — something that tends to get built reactively, after the first user reports a ghost result. The two main strategies are tombstones with filter-at-query-time (soft delete) and hard delete with index rebuild. Soft deletion is operationally simpler but accumulates state — and filters can drift or be missed when multiple indexes or tenants share the same infrastructure.

Cost signal

At scale, full re-embedding pipelines can become a significant recurring cost — some organizations reportedly spend thousands in GPU time per refresh cycle when processing large document collections. Selective re-embedding — only processing changed chunks rather than entire documents — can cut refresh cost and time-to-freshness dramatically, but it requires knowing what changed, which circles back to lineage.

Detection signals I'd watch for

  • Users reporting outdated information

  • Retrieval results referencing deprecated or deleted documents

  • Embedding freshness lag exceeding the agreed-upon SLA

  • Source pipeline failures that don't propagate alerts to downstream consumers

If I were responsible for this system, the first thing I'd want to know is: what's the source-of-truth pipeline, and what's my staleness budget?

Quality: Clean Enough for This Use Case

"Good enough" depends on the stakes

A customer-facing FAQ chatbot has different quality requirements than an internal code search tool. Universal thresholds rarely hold across contexts. A well-established principle in data quality work is "fitness for use" — evaluating quality relative to what decisions the data feeds, rather than applying a blanket standard. Survey literature on dataset quality in ML reinforces that quality requirements are stage-specific and context-dependent.

Six dimensions are commonly used to evaluate data quality: accuracy, completeness, consistency, timeliness, uniqueness, and validity. Not all of them matter equally for every use case. For a recommendation engine, completeness might be the critical dimension. For a medical triage assistant, accuracy dominates everything else.

Bias deserves special mention here. Bias in training data composition, labeling, or demographic representation is a quality dimension that's particularly hard to detect and measure — partly because it often looks like "correct" data until you examine outputs across different user segments.

The quality cliff

One study on tabular ML performance found that algorithms tolerated degrading data quality until a threshold around 0.8, where performance dropped steeply. The inflection point varies by algorithm type and task, but the general pattern — stability followed by a cliff — is worth watching for.

The concerning part: the cliff can be hard to detect in production. If you're not actively measuring output quality, you might not notice you've crossed it until user complaints start arriving.

Worth noting: in GenAI systems, "quality" includes both input quality and the measurement harness. If the eval is weak, you'll misdiagnose data problems as model problems — and vice versa. The quality cliff is only visible if you have an eval capable of detecting it.

What this looks like when it breaks

A few examples that stuck with me:

  • Unity Software: As multiple industry accounts have reported, a single corrupted file in Unity's ad targeting data resulted in a reported $110 million revenue impact. A quality
    failure, not a model failure.

  • RAG chunking: Basic fixed-size chunking scored significantly lower on faithfulness than semantic chunking — same model, same underlying data, different preparation decisions.

The trust trap

A common practitioner observation is that a bad RAG system can be worse than having no RAG at all. Once users lose confidence in a system's accuracy, they stop using it — even after the underlying quality improves. Trust recovery tends to be slower and more expensive than getting quality right upfront.

Detection signals I'd watch for

  • Declining user engagement or satisfaction scores

  • Rising hallucination rates in model outputs

  • Quality metrics drifting without corresponding model changes

  • Users building workarounds instead of using the AI feature

I wouldn't want to ship without understanding what "clean enough" means for this specific use case. That's a decision I'd want to make explicitly — not discover through user complaints.

Lineage: Can You Trace the Path?

Can you trace from raw input → transformed data → model input → model output? Can you version it? Can you explain why a specific answer was generated?

Lineage in the LLM era

In traditional ML, data lineage meant tracking datasets through transformation pipelines. For LLM-powered features, the scope is broader. "Data" now includes prompts, retrieved context, conversation history, and embeddings. End-to-end lineage for GenAI needs to trace from source systems → vector databases → LLMs → user-facing outputs.

Concretely, minimum viable lineage for an LLM feature means logging: source document IDs and versions, chunk IDs with offsets and hashes, embedding model version, index build ID, retrieval results (top-k with scores), prompt template version and final assembled prompt, model version and decoding parameters, and the
response with any post-processing steps. Not all of these need to be stored permanently, but being able to reconstruct the path from query to answer is the bar.

This matters for debugging. When an AI feature produces a wrong answer, the first question is "where did that come from?" Without lineage, you're guessing. With it, you can isolate whether the problem was the source data, the retrieval step, the prompt construction, or the model itself. That distinction changes the fix entirely.

Consider a scenario: a RAG system starts surfacing outdated policy answers after a data migration. The team spends days tuning prompts and adjusting retrieval parameters — debugging the model — before discovering the retrieval index was still pulling from a deprecated internal wiki that should have been decommissioned. With lineage, tracing the answer back to its source would have revealed this in minutes.

Rollback capability

Data versioning enables rollback when new data causes regressions. Without it, a bad data update is a one-way door. You can't return to the state that was working, and you may not even know which update caused the degradation.

This connects directly to the source quadrant: if selective re-embedding requires knowing what changed, lineage is what tells you.

The "works on my machine" problem

Without reproducibility, different environments (dev, staging, prod) may produce different outputs from the same query because they're pulling from different data snapshots. This makes debugging production issues significantly harder — the issue doesn't reproduce locally because the data is different.

Detection signals I'd watch for

  • Inability to explain why a specific output was generated

  • No rollback capability when data updates degrade quality

  • Audit failures due to missing provenance records

  • Different outputs for the same query across environments

The standard I'd set for myself: if someone asks "why did the system say that?", I should be able to trace the answer back to its sources within minutes, not days.

Privacy: Would You Be Comfortable If This Were Public?

What PII touches the pipeline? Do users know? Are retention policies enforced? Could a regulatory audit pass today?

Training data memorization

LLMs can reproduce fragments of training data verbatim. Security researchers have found that attackers can prompt models to output training data, and even anonymized datasets remain vulnerable to cross-referencing attacks. Training datasets have been found containing nearly 12,000 live API credentials. These are concrete risks, not theoretical ones.

Soft leaks

The harder-to-catch variant: models paraphrasing sensitive information from prior interactions — client names, internal reports, proprietary phrasing — without triggering keyword-based PII scans. As one analysis described, these "soft leaks" don't trip obvious alerts because the model isn't reproducing text verbatim. It's rephrasing it.

Shadow AI

One vendor analysis estimated that roughly 1 in 12 employee prompts to unapproved public AI tools contains confidential information. These interactions bypass formal access logs, masking, and audit trails entirely. The data is out there, and there's no record of it leaving.

The GDPR challenge

Data embedded in model weights cannot be easily removed. Erasure requests under GDPR raise difficult technical questions — research on machine unlearning shows that removing specific data influence from trained models remains an open problem. In practice, teams often fall back to retraining or architectural workarounds — like keeping sensitive data in deletable stores (RAG over governed data) rather than baking it into weights.

What the fines look like

The financial consequences are concrete. OpenAI was fined €15M by Italy's data protection authority, and Clearview AI was fined €30.5M by the Dutch DPA for illegal data collection. Wiz Research discovered a publicly accessible DeepSeek database containing chat history, API keys, and backend details. The reported OmniGPT breach allegedly exposed 30,000 user emails and 34 million chat messages. These incidents are becoming more frequent.

Privacy engineering patterns worth knowing

Several patterns can reduce exposure: token-level input filtering to redact secrets before model processing, differential privacy for training, context-based access control that evaluates data appropriateness dynamically, encrypted vector databases, and data minimization — limiting collection to what's necessary for the specified purpose. The right combination depends on your threat model and regulatory requirements.

Detection signals I'd watch for

  • No PII scan in the data ingestion pipeline

  • Retention policies not enforced or not defined

  • No audit trail for what data was sent to which model

  • Users able to extract information about other users through prompt manipulation — including prompt injection attacks that cause the system to retrieve and surface documents the user shouldn't have access to. In shared-index architectures, entitlements-aware filtering at retrieval time is the minimum bar; without it, cross-tenant data leakage is an architectural default, not an edge case

If I were responsible for this system, privacy would be a design constraint, not a compliance checkbox. The question I'd keep asking: would I be comfortable if our data handling were made public?

What I Would Not Do

A few anti-patterns I'd want to avoid:

Skip the data evaluation and go straight to model selection. Spending weeks evaluating models before understanding whether the data can support any of them is a common sequence — and an expensive one. The data ceiling applies regardless of which model you choose.

Set universal quality thresholds. "All data must be 99% accurate" is meaningless without context. An internal search tool and a medical triage assistant have fundamentally different bars, and the threshold should be driven by the cost of a wrong answer in each system.

Treat privacy as a post-launch concern. The cost of retrofitting privacy controls — technically and legally — tends to exceed the cost of designing them in from the start. And the regulatory exposure from the gap period is real.

Build without lineage, even for prototypes. The prototype becomes the production system more often than anyone plans. If lineage isn't built in from the beginning, it typically stays absent — and debugging becomes guesswork indefinitely.

The Cost of Getting This Wrong

Data quality failures tend to be more expensive than model failures for a specific reason: they're harder to detect, and they affect every downstream consumer of that data.

Unity's reported $110 million loss came from a single corrupted file — a quality failure, not a model failure. Systematic evaluation frameworks for RAG deployments remain uncommon, which means many teams aren't measuring whether data issues are degrading output quality.

The silent degradation pattern is what concerns me most: retrieval precision drops, hallucination rates climb, users get worse answers — and by the time anyone notices, the damage to user trust is already done. Unlike model upgrades, which are typically reversible (swap back to the previous version), data quality degradation often compounds over weeks or months before the symptoms become obvious.

To be clear, hallucinations are multi-causal — prompting, guardrails, model decoding, and retrieval configuration all contribute. But data quality issues are among the hardest root causes to isolate, precisely because they look like model problems from the outside.

What Would Change My Mind

Three developments would weaken the "data before models" argument:

Models that self-correct for noisy data. If a model could reliably detect and compensate for data quality issues at inference time — flagging low-confidence answers when source data looks unreliable — the upfront investment in data quality becomes less critical. Some retrieval systems are moving in this direction
with relevance scoring, but we're not there yet.

Automated data quality monitoring that's essentially free. If quality monitoring tools advanced to the point where real-time data quality assessment was a low-cost default rather than an explicit investment, the cost argument for upfront evaluation shifts. Current tools require meaningful setup and ongoing attention.

Privacy-preserving inference that eliminates raw data exposure. If confidential computing and differential privacy matured enough that raw PII never needed to enter the pipeline, the privacy quadrant becomes less of a design constraint. This is an active research area, but production-ready implementations are still limited.

Currently, none of these conditions are fully met. So the four-quadrant framework remains the bar I'd set for any AI feature I was responsible for.

A Checklist, Not a Gatekeeper

The four quadrants are meant to be a lightweight checklist — something you can walk through in an hour with a whiteboard — not a heavyweight approval gate. The goal is to understand your data ceiling before you commit to a model approach, not to achieve perfect data before you start building.

Here's the condensed version:

Quadrant

Key Questions

Source

Where does it come from? How fresh? What's the staleness budget? What happens when the pipeline breaks?

Quality

What does "clean enough" mean for this use case? Which quality dimensions matter most? Where's the quality cliff?

Lineage

Can I trace from input to output? Can I roll back? Can I reproduce results across environments?

Privacy

What PII touches the pipeline? Are retention policies enforced? Could this pass an audit today?

The model can't outperform what you feed it. Starting with data is how you avoid the expensive surprises later.

Reply

or to participate.