Reranker
A reranker is a second-stage refinement step that re-scores the candidate documents returned by first-stage retrieval against the query and reorders them by relevance. It typically relies on a cross-encoder, which takes the query and document together as a single input to produce a precise relevance score.
- A reranker is a second stage that re-evaluates the top candidates from first-stage retrieval (embeddings, BM25, and the like) against the query and reorders them by relevance.
- Its core technology is the cross-encoder, which feeds the query and document through a transformer as a single combined input to compute a precise relevance score.
- A bi-encoder (embeddings) is fast but compresses each document into a single vector and loses information, whereas a cross-encoder is slower but far more accurate.
- Because a cross-encoder must run a transformer inference for every candidate, applying it directly to an entire corpus is prohibitively expensive and slow.
- That is why the standard pattern is two-stage retrieval: cast a wide net with fast search to gather candidates (recall), then narrow and precisely rank them with a reranker (precision).
What Is a Reranker?
In a search or RAG (retrieval-augmented generation) pipeline, a reranker is a second-stage refinement step that re-evaluates the candidate documents returned by first-stage retrieval against the query and reorders them from most to least relevant. First-stage retrieval usually pulls back hundreds of candidates quickly, using either embedding-based vector search or keyword search such as BM25. The reranker takes only the top slice of those candidates, carefully re-scores how relevant each one actually is to the query, and re-ranks them by that score.
The technology at the heart of a reranker is the cross-encoder. As Pinecone describes it, a reranker "outputs a similarity score given a query and document pair," and unlike an embedding model that processes the query and document separately, it analyzes the two together. Elastic's documentation likewise explains that a cross-encoder "takes the query and document text as a single concatenated input" to produce a query-aware representation of the document.
Bi-encoder (Embeddings) vs. Cross-encoder (Reranker)
To understand a reranker, you need to grasp how it differs from the bi-encoder used in first-stage retrieval. The fundamental distinction between the two approaches comes down to one question: are the query and document encoded separately, or encoded together?
| Aspect | Bi-encoder (embeddings) | Cross-encoder (reranker) |
|---|---|---|
| Input method | Encodes the query and document separately | Merges the query and document into one input and encodes them together |
| Output | Reusable embedding vectors | A relevance score (0–1) for the query–document pair |
| Query awareness | Document embeddings are query-agnostic (precomputed) | Evaluates the document in the context of the query |
| Speed | Fast (only cosine-similarity comparisons) | Slow (a transformer inference per candidate) |
| Accuracy | Relatively lower | High |
| Scalability | Can pre-index up to billions of documents | Applied in real time to only a small set of top candidates |
| Role | First-stage retrieval (broad, recall-focused) | Second-stage reranking (narrow, precision-focused) |
The bi-encoder's limitation is information loss. Pinecone points out that a bi-encoder "must compress all of the possible meanings of a document into a single vector," and so it loses information. On top of that, the query is only known at runtime, so document embeddings built in advance cannot reflect the query's context. A cross-encoder, by contrast, "runs the original information through a large transformer computation" with little information loss, analyzing the document in a way that is "specific to the user's query."
Why a Cross-encoder Does Not Produce Embeddings
The Sentence Transformers (SBERT) documentation states plainly that "a cross-encoder does not produce a sentence embedding," and that you also cannot feed individual sentences into a cross-encoder on their own. That is because it passes two sentences through the transformer simultaneously and outputs only a similarity value between 0 and 1. In other words, a cross-encoder is highly accurate at answering "how relevant is this query to this document?" but cannot be used to produce reusable vectors for building a large-scale search index.
Why Split Into Two Stages: Balancing Speed and Precision
If a cross-encoder is more accurate, it might seem logical to apply it to every document from the start — but cost and latency make that impossible. Pinecone sums it up: "rerankers are slow, and retrievers are fast," and backs it with concrete numbers. Reranking 40 million records with a small BERT model on a V100 GPU would take more than 50 hours, whereas vector search finishes in under 100 milliseconds.
The SBERT documentation offers an example in the same vein. Clustering 10,000 sentences with a cross-encoder would require computing similarity for roughly 50 million combinations, taking about 65 hours, whereas obtaining each sentence's embedding with a bi-encoder takes just 5 seconds.
This is why the standard pattern is two-stage retrieval, which chains the two approaches together. The flow SBERT recommends works like this: first, use an efficient bi-encoder to quickly retrieve the top 100 candidates for the query; then use a cross-encoder to score each (query, candidate) pair and rerank those 100. The bi-encoder casts a wide net to secure recall, and the cross-encoder boosts precision. Elastic makes the same point, noting that semantic reranking "uses a relatively large and complex machine learning model in real time," so it makes sense to apply it as the final pipeline step to a small top-k result set.
Real-World Impact and Examples
A reranker promotes relevant documents that first-stage retrieval missed. In Pinecone's example, for a query about RLHF, the single most relevant piece of information sat at position 23 in the first-stage results — but after reranking, it rose to position 1. As a result, "much more relevant information" was passed to the LLM and noise was reduced. Pinecone calls reranking "one of the simplest methods for dramatically improving recall performance in RAG or retrieval-based pipelines," typically delivering far more relevant results in exchange for "a few lines of code and a small amount of latency."
Commercial rerankers are also widely used. Cohere's Rerank model "sorts text inputs by their semantic relevance to a query" and is frequently used to re-sort results returned by an existing search solution. Cohere Rerank combines the query tokens and document tokens per request, and models such as rerank-v3.5 have a per-document context limit of 4,096 tokens. When the combined query-plus-document token count exceeds the limit, the document is automatically split into chunks and processed across multiple inferences.
Implementation Checklist
- Retrieve a generous set of candidates with first-stage search (vector or BM25) — for example, the top 50–100 — and then apply the reranker.
- Apply the reranker only to the top-k candidate set rather than the entire corpus, to keep latency and cost under control.
- For RAG, place only the top few results after reranking into the LLM context to cut noise and improve accuracy.
- For multilingual documents, choose a multilingual reranker (for example, Cohere multilingual or rerank-v3.5).
- For long documents, design your chunking strategy with the model's per-document token limit (for example, 4,096) in mind.