Back to Glossary
GEO & AI Search

Chunking

Chunking is the process of splitting long documents into smaller units (chunks) suited to retrieval and embedding in a RAG pipeline. Because the split method, chunk size, and overlap directly shape retrieval accuracy and answer quality, it is treated as a core step in RAG pipeline design.

  • Chunking splits documents into retrieval- and embedding-sized units in RAG, and since the chunk is the smallest unit of retrieval, the quality of the split largely determines the quality of the answer.
  • The common strategies are fixed-size, recursive, document-structure-based, semantic, and contextual (including late chunking), and recursive splitting is the balanced default in most cases.
  • Pinecone suggests 512-token chunks with 50-100 tokens of overlap as a starting point, and the industry rule of thumb is 10-20% overlap.
  • Chunk size should be chosen from the embedding model's input length, the content type, and the expected query shape, then tuned by benchmarking on your actual corpus.
  • Anthropic's Contextual Retrieval prepends explanatory context to each chunk before embedding, cutting the top-20 chunk retrieval failure rate by as much as 49%.

What chunking is

Chunking is the work of dividing long source text into small pieces, or chunks, that are suited to embedding and vector search within a RAG (retrieval-augmented generation) pipeline. The retrieval step finds the chunks that are semantically closest to the query and passes them to the LLM as context, so the chunk becomes the smallest unit of retrieval and citation. How you cut the text therefore directly governs both retrieval accuracy and the quality of the final answer.

When a chunk is too large, several topics blur together inside it, the embedding loses focus, and irrelevant information enters the context and dilutes the answer. When a chunk is too small, sentences and arguments are cut off mid-thought, context disappears, and the information you actually need is scattered and missed during retrieval. The goal of chunking is to find, between these two extremes, a unit in which meaning is preserved intact.

Comparing chunking strategies

The chunking strategies most commonly used in practice are listed below. Semantic chunking is one branch among them; chunking is the umbrella concept that spans all of these strategies.

StrategySplit basisStrengthsLimitations
Fixed-sizeCuts uniformly every N tokens (or characters), with optional overlapSimplest to implement and produces uniform chunk sizesIgnores document structure, severing sentences and arguments mid-stream
RecursiveApplies separators in order (paragraph → line break → space → character), splitting down to the target sizeLargely preserves document structure with no external model; a general-purpose defaultRelies on separators, so its effect is limited on documents with weak structure
Document-basedUses the structure the document already provides, such as headings, sections, tables, and code blocksWell suited to clearly structured material like legal, technical, and API documentationHard to apply to plain text whose structure is inconsistent
SemanticUses sentence-embedding similarity to locate the boundaries where the topic shifts, then splits thereForms chunks along natural meaning boundariesChunk sizes vary widely, and the embedding computation adds cost
Contextual / latePrepends document context to a chunk before embedding, or embeds the whole document first and splits afterwardPreserves surrounding context to lift retrieval accuracyRequires LLM calls or a long-context model, raising cost and complexity

Recommended parameters and the evidence

Pinecone's chunking guide offers 512-token chunks with 50-100 tokens of overlap as a starting point, and recommends trying smaller chunks like 128-256 tokens when the content is short and fact-dense, and larger chunks like 512-1024 tokens when context matters. Overlap (the portion that adjacent chunks share so they carry some content in common) reduces the loss of context at chunk boundaries, and the industry rule of thumb is to start at 10-20% of the chunk size.

LangChain's RecursiveCharacterTextSplitter, the standard implementation of recursive splitting, applies the separator list ["\n\n", "\n", " ", ""] in priority order: it preserves paragraphs first, and if a piece is still too large it cuts by line break, then space, then character, retaining as much document structure as possible. The example parameters from the official docs are as follows.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,      # maximum chunk size
    chunk_overlap=20,    # overlap between adjacent chunks
)
chunks = splitter.split_text(document)

There is no absolute right answer for chunk size; set it to match the input length of the embedding model you use, the content type, and the expected query shape, and validate it by benchmarking on your real corpus. Cutting by character count rather than by tokens can drift out of sync with the embedding model's token limit, so it is safer to align the length function used for splitting with the tokenizer of your embedding model.

Approaches that compensate for a chunk's own lack of context are also an active area of research. Anthropic's Contextual Retrieval uses an LLM to generate, for each chunk, context explaining where that chunk sits within the document, prepends it, and then embeds the result. According to the announcement, this contextual embedding alone reduced the top-20 chunk retrieval failure rate by 35% (5.7% → 3.7%); combining it with contextual BM25 cut the rate by 49% (5.7% → 2.9%), and adding reranking brought the reduction to as much as 67%. With prompt caching, the chunk context can be generated for roughly $1.02 per million document tokens, keeping it cost-efficient. Separately, Late Chunking proposed by Jina AI (arXiv:2409.04701, 2024) does not cut chunks up front; instead it embeds the entire document with a long-context model and then groups the token embeddings into chunks, preserving the global context that ordinarily disappears during splitting.

Implementation checklist

  • Adopt recursive splitting (with separators matched to the document type) as your default, and start here unless you have a specific reason not to.
  • Begin with 512-token chunks plus 10-20% overlap (50-100 tokens), then adjust.
  • Measure the split-length function in tokens, aligned to your embedding model's tokenizer.
  • For documents with clear structure such as headings, tables, and code, consider document-structure-based splitting first.
  • Before spending the extra cost on semantic chunking, confirm with a benchmark on your own corpus that it actually improves on recursive splitting.
  • If a chunk's lack of context is the cause of retrieval failures, try introducing contextual embeddings or late chunking.

References