• Prompt/Deploy
  • Posts
  • The Chunking Problem: Size, Overlap, and How to Measure What Works

The Chunking Problem: Size, Overlap, and How to Measure What Works

How to think through chunking strategy and build the eval loop that tells you if it's working.

Everyone picks a chunk size. Almost nobody measures whether it's the right one.

You set it once — probably 512 tokens because that's what the tutorial used — and it became permanent. Sometimes retrieval feels off: the right answer is in the document but the wrong passage comes back. It's hard to tell if chunking is the problem or something downstream, so the setting stays.

This post covers two things: how to reason about chunking decisions (strategy, size, overlap), and how to build the eval loop that tells you whether those decisions are working. The eval part doesn't have to be expensive.

ℹ️ Note: This post is part of the Production RAG series. Post 3 covered the embedding pipeline and re-indexing — your pipeline design constrains how easy it is to change chunking strategy, so that context matters here.

The three decisions you're making (and why they're coupled)

Chunking involves three decisions that are usually treated independently but interact with each other:

1. Chunking strategy — how you determine chunk boundaries. Fixed-size splits at token count. Sentence-based splits at punctuation. Semantic chunking detects topic shifts between sentences.

2. Chunk size — how large each chunk is, measured in tokens (not characters). Small chunks are precise but lose context. Large chunks provide context but add noise to the
retrieval signal.

3. Overlap — how much content repeats between adjacent chunks. Intended to prevent concepts from being split across boundaries.

These aren't independent: chunk size interacts with your embedding model's context window and architecture. A size that works well with one model may perform worse with another. And changing chunking strategy later means a full re-index — the embedding pipeline cost from Post 3 applies here.

The practical implication: set these decisions deliberately, measure them, and treat a chunking change as a migration, not a configuration tweak.

The three decisions aren't independent. Content type drives starting size; embedding model architecture modifies the range. Measure before committing.

Fixed vs. semantic chunking: what each optimizes for

The default advice is to use semantic chunking because it "understands" content. The research is more nuanced.

A 2024 study comparing fixed-size, breakpoint-based semantic, and clustering-based semantic chunking across 10 document retrieval datasets found that fixed-size chunking performed best on 6 of them. For evidence retrieval (finding the specific passage that supports an answer), differences were within 1-2 percentage points across all three methods. Answer quality showed negligible variation.

The conclusion: "computational costs associated with semantic chunking are not justified by consistent performance gains."

Rule of thumb:

  • Default to fixed-size with sentence-boundary awareness

  • Consider semantic chunking when your corpus is assembled from multiple distinct sources, topics shift hard within documents, or you can measure that it helps

The practical implementation for most teams is RecursiveCharacterTextSplitter, which handles boundaries via a separator hierarchy: it tries to split on paragraph breaks (\n\n), then sentence breaks (\n), then punctuation (.), then spaces, in that order. The hierarchy does most of the boundary-handling work that overlap is often credited for.

⚠️ Measure chunk size in tokens, not characters. Your embedding model's context window is defined in tokens. A character-based chunk size gives misleading results — 512 characters is roughly 128 tokens for typical English text, which may be far smaller than intended.

ℹ️ Note: In recent LangChain versions (0.2+), text splitters live in the langchain-text-splitters package. If you get an import error, install it separately: pip install langchain-text-splitters.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

def split_documents(chunk_size: int, knowledge_base, tokenizer_name: str):
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )
    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicate chunks (can occur when separator hierarchy creates identical short segments)
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)
    return docs_processed_unique

Chunk size: match it to your content type

The optimal chunk size depends on the locality of the answer in your content — how concentrated the relevant information is.

Content type

Optimal range

Example

Fact-dense / concise QA

64–128 tokens

FAQ docs, technical reference

Technical / conceptual

512–1024 tokens

Architecture docs, tutorials

Narrative / long-form

512–1024 tokens

Reports, case studies

The data: on SQuAD (factual QA), 64-token chunks yield 64.1% Recall@1. Moving to 512 tokens drops recall by 10–15% because the additional content introduces noise into the retrieval signal. On TechQA (technical content), 512–1024-token chunks reach up to 71.5% Recall@1, and smaller chunks underperform because the answers require surrounding context to be meaningful.

Chunk size also interacts with embedding model architecture. Decoder-based models with large context windows (like Stella at 130k+ tokens) benefit from larger chunks.
Encoder-based models with smaller windows (like Snowflake at ~8k tokens) tend to perform better with smaller chunks. The same setting applied to both models produces different outcomes — which matters when you change models mid-deployment.

Starting point for most content: 256 tokens. Not because 256 is magic, but because it sits between the two extremes and gives you a baseline to measure from in both
directions.

Overlap and the boundary problem

The rationale for overlap is sound: if a concept spans a chunk boundary, you risk retrieving half the concept without the context needed to understand it. Overlap repeats content across boundaries to hedge against this.

The practical evidence is less clear. The 2024 semantic chunking study found no measurable benefit from overlap for most content — and notes it increases indexing cost. When the separator hierarchy in your text splitter is working correctly — splitting on paragraphs, then sentences, then words — the boundary problem is already largely handled before overlap applies.

Overlap's value is specifically when documents have no natural boundaries: dense structured data, transcripts without punctuation, or content where sentences routinely span multiple concepts. For well-structured prose, the separator hierarchy usually handles it.

The practical setting: 10% of chunk size (chunk_overlap=int(chunk_size / 10)). This is the HuggingFace cookbook default and a reasonable starting point. Measure whether removing it affects your retrieval metrics before treating it as essential.

Building the eval loop

The loop has three steps: generate a ground-truth dataset, run the retrieval eval, change one variable and compare. For a modest corpus (a few thousand documents), the first pass takes a few hours; scale time accordingly for larger indexes.

Step 1: Generate a synthetic ground-truth dataset

You need a set of (question, source chunk) pairs: for each question, you know which chunk should come back when you retrieve against it. The LLM-as-judge approach generates these without manual labeling.

QA_GENERATION_PROMPT = """
Your task is to write a factoid question and an answer given a context.
The factoid question should be answerable with a specific, concise piece of factual
information from the context.
The question should be formulated as a user would ask it — not referencing "the passage"
or "the context".

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Context: {context}
Output:::"""

Sample ~200 chunks from your index, generate a QA pair for each, then filter for quality. Score each pair on three dimensions (1–5): groundedness (answer is in the chunk),
relevance (question is meaningful standalone), and quality (question is well-formed). Keep pairs where all three scores are ≥ 4. Expect to filter out roughly half — you need ~100 usable pairs.

The critical detail: record the source chunk's ID when you sample, so you can verify retrieval. Most vector stores return a document or chunk ID in the metadata — capture it:

import random

eval_pairs = []
for chunk in random.sample(indexed_chunks, 200):
    qa = generate_qa_pair(chunk.page_content)  # calls LLM with QA_GENERATION_PROMPT
    if qa and passes_quality_filter(qa):
        eval_pairs.append({
            "question": qa["question"],
            "answer": qa["answer"],
            "source_chunk_id": chunk.metadata["id"],  # from your vector store
        })

💡 On synthetic dataset limitations: LLM-generated questions tend to use vocabulary from the source chunk, which makes retrieval look easier than it is for real user queries. Recent research confirms synthetic benchmarks underestimate task difficulty. Use synthetic datasets to detect regressions between chunking configurations — not to benchmark absolute retrieval quality.

Step 2: Run the retrieval eval

With a ground-truth dataset, compute two metrics:

Hit@k (Hit Rate): For each question, does the correct chunk appear in the top-k results? Binary, easy to compute.

MRR@k (Mean Reciprocal Rank): For each question, what is the reciprocal rank of the first correct result? Rewards ranking quality — a system that returns the right chunk at
position 1 scores higher than one that returns it at position 5.

def compute_retrieval_metrics(eval_pairs, retriever, k=5):
    """
    eval_pairs: list of {"question": str, "source_chunk_id": str} dicts
    retriever: has a .retrieve(query, top_k) method returning results with .chunk_id
    """
    if not eval_pairs:
        return {"hit_rate_at_k": 0, "mrr": 0, "k": k}

    hits = 0
    reciprocal_ranks = []

    for pair in eval_pairs:
        results = retriever.retrieve(pair["question"], top_k=k)
        result_ids = [r.chunk_id for r in results]

        # Hit@k
        if pair["source_chunk_id"] in result_ids:
            hits += 1

        # MRR
        if pair["source_chunk_id"] in result_ids:
            rank = result_ids.index(pair["source_chunk_id"]) + 1
            reciprocal_ranks.append(1.0 / rank)
        else:
            reciprocal_ranks.append(0.0)

    hit_rate = hits / len(eval_pairs)
    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
    return {"hit_rate_at_k": hit_rate, "mrr": mrr, "k": k}

Run this before changing anything. Record the baseline.

Step 3: Change one variable, re-run

Change chunk size — say, from 256 to 128 or from 256 to 512. Re-embed the corpus with the new chunk size. Re-run the eval. Compare Hit@5 and MRR.

If the metrics don't move: chunking probably isn't your problem. Something else in the pipeline is the bottleneck.

If one direction improves: you have signal. Validate with a second round to confirm it's not noise, then make the change production.

The eval loop. Takes a few hours to set up the first time; subsequent runs are fast.

How to tell if chunking is your problem

Retrieval failure and generation failure look similar from the outside — both produce wrong answers. The distinction matters because the fixes are completely different.

Diagnostic: Log the retrieved chunks for a set of known-bad outputs. If the wrong chunk is returning for a query, the problem is in retrieval (chunking, embedding, or search configuration). If the right chunk is returning but the answer is wrong, the problem is in generation.

Chunk-specific failure signatures:

Symptom

Likely cause

Answer is almost right but missing key context

Chunk size too small; answer spans a boundary

Retriever returns vaguely related passages, not the exact content

Chunk size too large; noise diluting the retrieval signal

Answers are accurate when retrieved chunk is correct, but recall is low

Boundary handling issue; overlap or separator strategy

Performance inconsistent across document types

Uniform chunk size applied to heterogeneous content

The eval loop catches most of these systematically. If you don't have it yet, the log-the-retrieved-chunks approach gives you enough signal to diagnose the direction.

What to do

Setup:

  • Start with RecursiveCharacterTextSplitter, tokenizer-based chunk measurement, 256 tokens, 10% overlap

  • This is a reasonable default for mixed-content corpora and gives you something measurable to start from

Measure before tuning:

  • Build the 100-sample eval dataset (a few hours for modest corpora)

  • Compute baseline Hit@5 and MRR@10

  • You now have something to measure against

Tune:

  • Test one size smaller and one larger (e.g., 128, 256, 512)

  • Hold everything else constant — same embedding model, same separator configuration

  • Compare metrics; pick the best, then validate once more

On semantic chunking:

  • Only consider it if your content has hard topic shifts and fixed-size consistently underperforms by more than noise

  • The computational cost is real; the performance gain usually isn't

On overlap:

  • Keep the 10% default but measure it

  • If your content has natural paragraph boundaries, overlap is unlikely to be load-bearing

  • Remove it in one experiment; if metrics hold, the cost saving is free

Further reading

Reply

or to participate.