- Prompt/Deploy
- Posts
- Incremental Re-indexing and the Embedding Pipeline Nobody Talks About
Incremental Re-indexing and the Embedding Pipeline Nobody Talks About
Embedding once is easy. Here's the production engineering behind keeping your RAG index correct, fresh, and survivable when you need to change your model or chunking strategy

Embedding once is easy. You call the API, store the vector, move on.
The tutorials stop there. But in production, the embedding pipeline is not a one-time job — it's an ongoing data management system. Documents change. Your embedding model gets deprecated. Your chunking strategy turns out to be wrong. You need to update the index without corrupting it, without taking the system down, and without paying to re-embed content that hasn't changed.
That's the part most tutorials skip.
This post covers three problems that come up in production RAG systems after the prototype is working: embedding model lock-in, idempotent partial re-indexing, and the freshness vs. throughput tradeoff. If you're still on your first deployment, consider this a heads-up. If you're already dealing with these, this is the reasoning I'd work through.
ℹ️ Note: This post is part of the Production RAG series. Post 2 covered vector database selection — your vector DB choice constrains your pipeline design, so that context matters here.
Why the index is harder to maintain than it looks
A vector index stores documents as understood by a specific model at a specific point in time — not as neutral records. That distinction matters when anything changes.
Embedding models don't produce a neutral, universal representation of text. They produce coordinates in a high-dimensional space that's specific to that model's training. Two models trained on the same corpus with different architectures produce incompatible coordinate systems. If you embed half your documents with model A and half with model B, your similarity search is comparing coordinates from two different maps.
This failure mode is quiet. The system doesn't error. It returns results — just the wrong ones, because the query vector (produced by whichever model you're currently calling) lands in a neighborhood that only makes sense in one coordinate system, while the other half of your index is organized by a different geometry.

Same documents, different models, incompatible coordinate spaces. A query against a mixed-model index lands in the right neighborhood for one model's geometry — and the wrong one for the other.
The practical implication: every time you change your embedding model, you must re-embed all content that will be compared against queries using the new model. Within a single index, there's no clean partial migration path — you can maintain two separate indexes in parallel during a transition, but you can't mix model versions within one. Mixed-model indexes produce silently degraded retrieval.
The three problems, and how they compound
Problem 1: Model lock-in
Embedding model selection looks like a configuration decision. In production, it's a migration commitment.
Every model upgrade triggers a re-index of the entire corpus. The migration window — the period where you're building the new index while the old one is live — is where retrieval quality is at risk. If you get this wrong and both model versions exist in the same index simultaneously, your queries start returning inconsistent results that are hard to debug because everything appears to be working.
This is a data migration problem more than a machine learning problem. The engineering discipline that applies is the same discipline you'd bring to renaming a column in a high-traffic database: plan for the transition state, not just the destination state.
Problem 2: Partial re-indexing without idempotency
Even without a model change, documents in a production system update continuously. You need a pipeline that can run repeatedly — on a schedule, in response to document updates, or on demand — without creating duplicates or inconsistency.
Without idempotency, every pipeline re-run embeds every document again, overwriting existing vectors or creating duplicates depending on how your upsert logic works. Duplicate vectors degrade retrieval precision silently: the same content appears multiple times in result sets, consuming slots that should go to diverse, relevant results.
The fix is content-addressable hashing: compute a hash of each normalized document chunk before deciding whether to embed it. If the hash matches what's stored, skip the embedding. If it's changed or new, embed and update. This converts a re-run from "re-process everything" to "re-process only what changed."
Research into production-grade vector architectures demonstrates that with content-addressable hashing, only 10–15% of content typically needs re-processing between runs, compared to 100% for a full re-index. At scale, that difference is
meaningful in both API cost and pipeline runtime.
Problem 3: Schema changes compound the migration
Not all changes require a full re-index. Understanding which changes require what response prevents unnecessary work:
Change type | Re-index required? | Notes |
|---|---|---|
New metadata field (no content change) | Usually no | Can upsert metadata without re-embedding |
Chunking strategy change | Yes, full re-index | Chunk boundaries change → new vectors for all affected content |
Embedding model change | Yes, full re-index | All content must be re-embedded with the new model |
Document content update | Partial (changed docs only) | Hash check catches this; only modified chunks need re-embedding |
Model + chunking change simultaneously | Yes, full re-index | No way to separate these — both are coupled to the vectors |

Decision tree: which change requires what level of re-indexing. Most changes don't require a full re-index — knowing the path prevents unnecessary work.
One additional failure mode worth naming: orphan vectors. When documents are deleted or re-chunked, the old chunk vectors remain in the index unless explicitly removed. Over time, these orphans accumulate and silently degrade retrieval precision. Garbage collection in the index is a real maintenance task.
Building an idempotent embedding pipeline
The content hash check
The core primitive:
import hashlib
import re
def compute_chunk_hash(text: str) -> str:
"""Normalize and hash a chunk for change detection."""
# Strip, lowercase, and collapse whitespace so minor formatting
# changes (trailing newlines, extra spaces) don't trigger re-embeds.
normalized = re.sub(r'\s+', ' ', text.strip().lower())
return hashlib.sha256(normalized.encode()).hexdigest()
def should_reembed(chunk_id: str, new_hash: str, hash_store: dict) -> bool:
"""Returns True only if the chunk is new or content has changed."""
return hash_store.get(chunk_id) != new_hash
One consideration before you start embedding: most embedding APIs silently truncate inputs that exceed their token limit. If your chunks are long, verify the model's context window and handle truncation explicitly — silent truncation means two chunks with the same first N tokens hash differently but produce the same embedding.
You need somewhere to store these hashes alongside your vectors. In practice, this is usually a Redis cache keyed by chunk ID, or a metadata table in your vector database that stores the hash alongside the vector. Most vector databases support storing arbitrary metadata per record — use that.
The normalization step matters. If you hash raw text including whitespace and case, minor formatting changes (trailing newlines, capitalization fixes) will trigger unnecessary re-embeds. Normalize first: strip, lowercase, and collapse whitespace before hashing — the re.sub(r'\s+', ' ', ...) handles the whitespace collapse that .strip() alone misses.
⚠️ Production note: These examples omit error handling for clarity. In production, wrap embedding API calls with retries and exponential backoff, and handle vector DB write failures explicitly. Most embedding providers have rate limits that surface as transient errors under load.
Batching with backpressure
Processing documents one at a time against an embedding API is expensive per-call and slow. Batch processing (50+ documents per API call depending on your provider's limits) is standard. The less obvious piece is what happens around the batch:
import time
def batch_reindex(documents, hash_store, batch_size=50, rate_limit_delay=0.5):
"""Re-embed only changed documents, in batches, with per-batch commits.
Note: `documents` should be a generator or lazy iterable for large corpora - materializing the full filtered list into memory can be expensive at scale.
"""
to_reembed = (
doc for doc in documents
if should_reembed(doc.chunk_id, compute_chunk_hash(doc.text), hash_store)
)
batch = []
for doc in to_reembed:
batch.append(doc)
if len(batch) >= batch_size:
embeddings = embed_batch([d.text for d in batch])
# Commit per batch, not per job — avoids long-held transactions
upsert_batch(batch, embeddings)
update_hash_store(batch, hash_store)
# Respect API rate limits between batches
time.sleep(rate_limit_delay)
batch = []
# Flush remaining documents
if batch:
embeddings = embed_batch([d.text for d in batch])
upsert_batch(batch, embeddings)
update_hash_store(batch, hash_store)
Per-batch commits matter in production. A single transaction for an entire re-index job means that if the job fails halfway through, you either roll back everything or commit a partial state. Per-batch commits make the pipeline resumable: if it fails on batch 40 of 200, the next run picks up where it left off because batches 1–39 already have updated hashes.
Orphan garbage collection
After each pipeline run, any chunk IDs in your index that are no longer in the source dataset are orphans. The cleanup pattern:
def remove_orphans(index, current_chunk_ids: set, dry_run: bool = False):
"""Delete vectors for chunks no longer in the source dataset.
Note: `index.list_all_ids()` is pseudocode — actual API varies by vector DB. Qdrant uses scroll, Pinecone uses list/fetch, Weaviate uses cursor-based iteration.
"""
indexed_ids = set(index.list_all_ids())
orphan_ids = indexed_ids - current_chunk_ids
if orphan_ids:
if dry_run:
print(f"Dry run: would delete {len(orphan_ids)} orphan vectors")
return
index.delete(ids=list(orphan_ids))
💡 Tip: Run with dry_run=True the first time to verify the orphan list before deletion. A misconfigured current_chunk_ids set could otherwise delete valid vectors.
How often to run this depends on your delete rate. For collections with low delete rates, I'd start with a daily reconciliation sweep and adjust based on observed orphan accumulation. For systems with high delete rates, run it as part of every indexing job.
Upgrading your embedding model without downtime
The shadow index (blue-green) pattern is the production-validated approach for zero-downtime embedding model migration. A practitioner walkthrough of this migration
documents how this works end-to-end.
The sequence

The feature flag is the key mechanism — the actual database swap is secondary. In-flight queries complete against V1; new queries go to V2 the moment the flag flips.
Config that enables this
Two environment variables are the key enabler:
import os
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-small")
EMBED_DIMENSIONS = int(os.getenv("EMBED_DIMENSIONS", "1536"))
USE_V2_EMBEDDINGS = os.getenv("USE_V2_EMBEDDINGS", "false").lower() == "true"
def get_query_index():
"""Route queries to the appropriate index based on feature flag."""
return "index_v2" if USE_V2_EMBEDDINGS else "index_v1"
With model config in environment variables, rolling back a bad migration is a config change, not a code deployment. That's the difference between a 10-minute rollback and a multi-hour one.
Build the shadow index
Use your standard batch re-embedding pipeline, pointed at a new collection. Per-batch commits, rate limiting, the works. The only difference is the destination: write to index_v2 while index_v1 stays live.
This doubles your storage costs during the migration window. For large indexes, plan for 24–72 hours of parallel operation. Budget accordingly.
Validate before swapping
Run a fixed set of golden queries — known inputs with known expected results — against both indexes. Compare the result set overlap. One documented migration
reported 82% result overlap as a positive signal. If overlap drops below ~50%, investigate before swapping — different models produce different geometries, but the most relevant results should be broadly consistent.
⚠️ Warning: Result overlap is a sanity check, not a quality guarantee. A high overlap with a low-quality old model just means the new model is also retrieving low-quality results consistently. Where possible, test against labeled ground truth, not just overlap with the current index.
The swap
Most vector databases support atomic collection swaps via an alias mechanism (Qdrant, Pinecone, and Meilisearch all have this). The app points at an alias (production), and you update the alias to point at the new collection — when supported, this is an atomic operation from the database's perspective, so in-flight queries complete against v1 while new queries go to v2.
After the swap, leave both collections running for a stability window (typically 24–48 hours) before dropping the old one. If retrieval quality degrades after the swap, you want the old collection available to roll back to.
Freshness vs. throughput: making the tradeoff explicit
Most teams don't choose their indexing mode — they inherit it from whatever was convenient to implement, then discover the tradeoffs when they scale.
Mode | Staleness | API cost | Failure blast radius | Best for |
|---|---|---|---|---|
Real-time (per-document) | ~0 | High (one API call per doc) | Small (one doc at a time) | Customer-facing, high-change docs |
Micro-batch (every N minutes) | Minutes | Medium | Medium | Most production systems |
Batch (scheduled, off-peak) | Hours | Low (max batching efficiency) | Large (one run, whole corpus) | Internal tools, infrequently updated content |
The question to answer explicitly at design time: what is the user-visible impact of a 1-hour stale index? For a customer support knowledge base, a 1-hour-old answer to a pricing question could be wrong. For internal engineering docs, it probably doesn't matter.
If you don't answer this question, you'll answer it implicitly when a real-time indexing pipeline hits API rate limits at scale and you discover the pipeline was designed for low volume.
Cost implications worth noting
Content hash storage adds infrastructure (Redis or a metadata table), but the API cost savings on re-runs can significantly reduce embedding costs as corpus size and change frequency grow.
Blue-green migration doubles storage for the transition window — for a large index, this can be significant.
Orphan accumulation has a cost too: larger indexes mean higher query latency and infrastructure cost. As your corpus grows, orphan cleanup becomes a first-class maintenance task rather than a cleanup afterthought.
The mental model shift
The production insight I keep coming back to: stop thinking about the embedding pipeline as a job that runs once at setup, and start thinking about it as a data migration system that runs continuously.
Data migration systems need idempotency (safe to re-run), version awareness (what model produced this vector), migration tooling (shadow index, feature flags), and garbage collection (orphan cleanup). These aren't RAG-specific concerns — they're the same engineering discipline you'd bring to any system where data evolves and the schema can change.
Three things I'd build in from the start:
Content hash check — makes re-runs safe and cheap
Shadow index capability — makes model migrations survivable
Explicit freshness SLA — forces the throughput tradeoff to be a decision, not an accident
What I'd watch in production: API cost per pipeline run (a sudden increase usually means the hash check stopped working), index size growth over time (signals orphan accumulation), and result quality metrics after any schema change.
The next post in this series covers chunking strategy and how to measure whether it's working. Chunking and embedding are closely coupled — changing your chunking strategy triggers a full re-index, which is exactly why the migration infrastructure described here needs to be in place before you start experimenting with chunk sizes.
Further Reading
LiveVectorLake: A Real-Time Versioned Knowledge Base Architecture — content-addressable hashing and dual-tier storage patterns
RAGOps: Operating and Managing Retrieval-Augmented Generation Pipelines — operational framework for production RAG
Zero-Downtime Embedding Migration
— practitioner walkthrough of the shadow index pattern
Reply