Back to Glossary
GEO & AI Search

Reranker

A reranker is a second-stage refinement step that re-scores the candidate documents returned by first-stage retrieval against the query and reorders them by relevance. It typically relies on a cross-encoder, which takes the query and document together as a single input to produce a precise relevance score.

  • A reranker is a second stage that re-evaluates the top candidates from first-stage retrieval (embeddings, BM25, and the like) against the query and reorders them by relevance.
  • Its core technology is the cross-encoder, which feeds the query and document through a transformer as a single combined input to compute a precise relevance score.
  • A bi-encoder (embeddings) is fast but compresses each document into a single vector and loses information, whereas a cross-encoder is slower but far more accurate.
  • Because a cross-encoder must run a transformer inference for every candidate, applying it directly to an entire corpus is prohibitively expensive and slow.
  • That is why the standard pattern is two-stage retrieval: cast a wide net with fast search to gather candidates (recall), then narrow and precisely rank them with a reranker (precision).

What Is a Reranker?

In a search or RAG (retrieval-augmented generation) pipeline, a reranker is a second-stage refinement step that re-evaluates the candidate documents returned by first-stage retrieval against the query and reorders them from most to least relevant. First-stage retrieval usually pulls back hundreds of candidates quickly, using either embedding-based vector search or keyword search such as BM25. The reranker takes only the top slice of those candidates, carefully re-scores how relevant each one actually is to the query, and re-ranks them by that score.

The technology at the heart of a reranker is the cross-encoder. As Pinecone describes it, a reranker "outputs a similarity score given a query and document pair," and unlike an embedding model that processes the query and document separately, it analyzes the two together. Elastic's documentation likewise explains that a cross-encoder "takes the query and document text as a single concatenated input" to produce a query-aware representation of the document.

Bi-encoder (Embeddings) vs. Cross-encoder (Reranker)

To understand a reranker, you need to grasp how it differs from the bi-encoder used in first-stage retrieval. The fundamental distinction between the two approaches comes down to one question: are the query and document encoded separately, or encoded together?

AspectBi-encoder (embeddings)Cross-encoder (reranker)
Input methodEncodes the query and document separatelyMerges the query and document into one input and encodes them together
OutputReusable embedding vectorsA relevance score (0–1) for the query–document pair
Query awarenessDocument embeddings are query-agnostic (precomputed)Evaluates the document in the context of the query
SpeedFast (only cosine-similarity comparisons)Slow (a transformer inference per candidate)
AccuracyRelatively lowerHigh
ScalabilityCan pre-index up to billions of documentsApplied in real time to only a small set of top candidates
RoleFirst-stage retrieval (broad, recall-focused)Second-stage reranking (narrow, precision-focused)

The bi-encoder's limitation is information loss. Pinecone points out that a bi-encoder "must compress all of the possible meanings of a document into a single vector," and so it loses information. On top of that, the query is only known at runtime, so document embeddings built in advance cannot reflect the query's context. A cross-encoder, by contrast, "runs the original information through a large transformer computation" with little information loss, analyzing the document in a way that is "specific to the user's query."

Why a Cross-encoder Does Not Produce Embeddings

The Sentence Transformers (SBERT) documentation states plainly that "a cross-encoder does not produce a sentence embedding," and that you also cannot feed individual sentences into a cross-encoder on their own. That is because it passes two sentences through the transformer simultaneously and outputs only a similarity value between 0 and 1. In other words, a cross-encoder is highly accurate at answering "how relevant is this query to this document?" but cannot be used to produce reusable vectors for building a large-scale search index.

Why Split Into Two Stages: Balancing Speed and Precision

If a cross-encoder is more accurate, it might seem logical to apply it to every document from the start — but cost and latency make that impossible. Pinecone sums it up: "rerankers are slow, and retrievers are fast," and backs it with concrete numbers. Reranking 40 million records with a small BERT model on a V100 GPU would take more than 50 hours, whereas vector search finishes in under 100 milliseconds.

The SBERT documentation offers an example in the same vein. Clustering 10,000 sentences with a cross-encoder would require computing similarity for roughly 50 million combinations, taking about 65 hours, whereas obtaining each sentence's embedding with a bi-encoder takes just 5 seconds.

This is why the standard pattern is two-stage retrieval, which chains the two approaches together. The flow SBERT recommends works like this: first, use an efficient bi-encoder to quickly retrieve the top 100 candidates for the query; then use a cross-encoder to score each (query, candidate) pair and rerank those 100. The bi-encoder casts a wide net to secure recall, and the cross-encoder boosts precision. Elastic makes the same point, noting that semantic reranking "uses a relatively large and complex machine learning model in real time," so it makes sense to apply it as the final pipeline step to a small top-k result set.

Real-World Impact and Examples

A reranker promotes relevant documents that first-stage retrieval missed. In Pinecone's example, for a query about RLHF, the single most relevant piece of information sat at position 23 in the first-stage results — but after reranking, it rose to position 1. As a result, "much more relevant information" was passed to the LLM and noise was reduced. Pinecone calls reranking "one of the simplest methods for dramatically improving recall performance in RAG or retrieval-based pipelines," typically delivering far more relevant results in exchange for "a few lines of code and a small amount of latency."

Commercial rerankers are also widely used. Cohere's Rerank model "sorts text inputs by their semantic relevance to a query" and is frequently used to re-sort results returned by an existing search solution. Cohere Rerank combines the query tokens and document tokens per request, and models such as rerank-v3.5 have a per-document context limit of 4,096 tokens. When the combined query-plus-document token count exceeds the limit, the document is automatically split into chunks and processed across multiple inferences.

Implementation Checklist

  • Retrieve a generous set of candidates with first-stage search (vector or BM25) — for example, the top 50–100 — and then apply the reranker.
  • Apply the reranker only to the top-k candidate set rather than the entire corpus, to keep latency and cost under control.
  • For RAG, place only the top few results after reranking into the LLM context to cut noise and improve accuracy.
  • For multilingual documents, choose a multilingual reranker (for example, Cohere multilingual or rerank-v3.5).
  • For long documents, design your chunking strategy with the model's per-document token limit (for example, 4,096) in mind.

References