Back to Glossary
GEO & AI Search

RAG

RAG (Retrieval-Augmented Generation) is a technique in which an LLM first retrieves relevant documents from an external knowledge base and then grounds its answer in that retrieved content. Instead of relying solely on the knowledge baked into its parameters, the model pulls in current or specialized material at answer time, improving accuracy and citing sources without any retraining.

  • RAG pairs a retrieval step that fetches external documents with a generation step that answers using them as context, supplementing the model's fixed parametric knowledge with external knowledge.
  • The term was coined by Lewis et al. in their 2020 paper (arXiv:2005.11401, NeurIPS 2020), whose original design queries a dense vector index of Wikipedia with a neural retriever.
  • Unlike fine-tuning, RAG reflects new information by updating the knowledge base instead of retraining the model, making it cost-efficient while reducing knowledge-cutoff gaps and hallucination.
  • Because answers are grounded in retrieved documents, RAG can show its sources — exactly how generative search like ChatGPT, Perplexity, and Google AI Overviews cite web pages.
  • From a GEO standpoint, getting your content chosen as an AI's cited source means structuring it so it is easily recovered during the retrieval, chunking, and embedding stages.

What Is RAG?

RAG (Retrieval-Augmented Generation) is a technique that, before an LLM generates a response, first retrieves documents relevant to the question from an external knowledge base, feeds that content into the model's input (context), and lets the model generate its answer from there. AWS defines RAG as "the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response." In other words, rather than leaning only on what it already knows, the model pulls in trustworthy material at answer time and reasons from it.

This structure matters because of two intrinsic limitations of LLMs. First, a model's knowledge is frozen at training time (the knowledge cutoff), so it has no awareness of later information or of private, internal documents. Second, models hallucinate — fabricating plausible-sounding answers to things they do not know. By injecting external documents as the basis for the answer, RAG mitigates both problems at once and makes it possible to cite which documents were used.

The Origin: Lewis et al. (2020)

The term "RAG" and its methodology come from the 2020 paper by Patrick Lewis and eleven co-authors, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (arXiv:2005.11401, submitted May 22, 2020; accepted at NeurIPS 2020). The paper defines RAG as a language generation model that combines a pre-trained parametric memory with a non-parametric memory. The parametric memory is a pre-trained seq2seq model, while the non-parametric memory is a dense vector index of Wikipedia accessed with a pre-trained neural retriever. Put simply, the core idea of RAG is to use both the knowledge embedded in the model's weights (parametric) and explicitly stored, searchable external knowledge (non-parametric) together.

The paper reported that RAG set state-of-the-art results on three open-domain QA tasks and generated language that was "more specific, diverse and factual" than a seq2seq baseline relying on parametric memory alone. As the authors put it, the motivation was that providing "provenance for their decisions and updating their world knowledge remain open research problems." As a footnote, NVIDIA's technical blog recounts that lead author Patrick Lewis jokingly apologized for not coming up with a more appealing name than the acronym "RAG" itself.

RAG vs. Fine-Tuning

The two canonical ways to give an LLM domain-specific knowledge are RAG and fine-tuning. They are not substitutes — they serve different purposes, and in practice teams often combine them.

DimensionRAG (Retrieval-Augmented Generation)Fine-Tuning
How knowledge is injectedRetrieves external documents at inference time and injects them as contextUpdates the model's weights themselves during training
Keeping it currentReflected instantly by swapping the knowledge-base documentsRequires retraining on new data
CostNo retraining needed; relatively inexpensiveHigh cost for training compute and data preparation
Source attributionCan cite the retrieved documents as evidenceHard to trace sources
HallucinationMitigated by external groundingKnowledge is injected, but evidence is weak
Best suited forCurrent and factual knowledge, internal documents, QA where sources matterLearning behavior such as tone, format, and domain vocabulary

In short, RAG is the stronger tool for handling "what the model knows" (knowledge), while fine-tuning excels at "how the model speaks and behaves" (style and format). AWS likewise describes RAG as "a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful" without retraining the model.

Components of a RAG Pipeline

A typical RAG pipeline operates through the following stages.

  1. Indexing (preparation): Documents are split into meaningful units (chunking), and each chunk is converted into an embedding vector and stored in a vector database.
  2. Retrieval: The user's question is converted into the same embedding space to recover the semantically closest chunks (semantic search). When needed, a reranker reorders the candidates by relevance.
  3. Augmentation: The recovered chunks are merged into the prompt as context and passed to the model.
  4. Generation: The LLM generates its answer based on the injected evidence documents and indicates which ones it referenced.

Tying answers to external evidence this way is called grounding, and RAG is the most common way to implement it.

Search and GEO Context: AI Cites External Documents

RAG is not just an internal-chatbot technique — it is how generative search itself works today. Answer engines like ChatGPT's web search, Perplexity, and Google AI Overviews retrieve and recover web documents for a user's question, generate an answer grounded in that content, and attach source links. This is essentially "RAG with the entire web as the knowledge base." The core challenge of GEO (Generative Engine Optimization) therefore comes down to one thing: making your content the document that gets recovered during the AI's retrieval step and cited during its generation step.

In practice, the key is preparing content in a form that AI can easily recover and cite. Clear question-and-answer structure, self-contained paragraphs that hold up as standalone units (chunking-friendly structure), writing with explicit facts, figures, and sources, and structured markup all raise both recoverability and the likelihood of being cited by AI. Conversely, vague writing with scattered context fails to be properly recovered during retrieval and rarely makes it into AI answers.

Practical Checklist

  • When knowledge changes frequently or source attribution matters, consider RAG before fine-tuning.
  • Chunk documents into meaningful units, and write each chunk so it still makes sense when read on its own.
  • Retrieval quality drives RAG performance, so validate recovery accuracy with your embedding model and reranker first.
  • Expose the supporting documents alongside generated answers so hallucinations can be verified.
  • For GEO purposes, organize content into question-answer format, self-contained paragraphs, and structured markup to raise the odds of AI recovery and citation.

References

What is RAG? | Search OS