GEO & AI Search

Context Window

A context window is the maximum span of input and output tokens a large language model (LLM) can reference together in a single request. It functions as the model's working memory, and anything beyond this limit gets truncated or goes unprocessed.

A context window is the maximum capacity of input plus output tokens a model can handle in one pass, and the prompt, conversation history, attached documents, and generated answer all have to fit inside it.
Limits vary widely by model: GPT-4o supports 128K, standard Claude supports 200K, the GPT-4.1 family supports 1M, and Gemini 1.5 Pro supports 2M tokens.
A larger window does not automatically mean better performance, because accuracy and recall tend to degrade as the token count grows, a phenomenon known as context rot.
The "Lost in the Middle" study (arXiv:2307.03172) shows a U-shaped curve in which performance peaks when key information sits at the very start or end of the input and drops sharply when it falls in the middle.
A context window is a notion of capacity and limit, whereas deciding what to put into that finite space and how belongs to context engineering, an adjacent but distinctly focused discipline.

What Is a Context Window

A context window is the maximum span of text a large language model can draw on when generating a response. The system prompt, the user's question, prior conversation history, any attached documents, and the answer the model itself produces all fall within this span. IBM describes it as the amount of text a model can take into account at once as it processes input and generates output.

The unit here is the token, not the word. A token is a fragment the model splits text into; in English roughly one token corresponds to about four characters, while Korean tends to break down into smaller pieces at the character or morpheme level, so the same sentence can yield a higher token count. If a context window is 128K tokens, it means a single request can hold only about 128,000 tokens of input and output combined.

Critically, the context window is a different concept from the vast knowledge a model has learned. If training data is the model's long-term knowledge, the context window is closer to a "working memory" that operates only for the request at hand. In its official documentation, Anthropic likens the context window to the model's working memory, explaining that a larger window lets the model handle more complex and longer prompts.

Context Window Length of Major Models

The table below lists context window limits for representative models, verified against official documentation and announcements. Figures can change depending on the model version and serving environment, so for real-world use, check each provider's latest model comparison.

Model	Provider	Context Window	Notes
GPT-4o	OpenAI	128,000 tokens	Maximum output around 16K tokens
GPT-4.1 / mini / nano	OpenAI	1,000,000 tokens	Major expansion over GPT-4o (128K)
Claude (standard, e.g., Sonnet 4.5)	Anthropic	200,000 tokens	Some higher-tier models support 1M tokens
Gemini 1.5 Pro	Google	2,000,000 tokens	Longest context at the time of announcement
Gemini 1.5 Flash	Google	1,000,000 tokens	Lightweight, high-speed version

In its GPT-4.1 announcement, OpenAI stated that this model family can process roughly 750,000 words (about 3,000 pages), and Google explained that Gemini 1.5 Pro's 2 million tokens amount to about 19 hours of audio or thousands of pages of text. In short, the size of the context window determines how much material you can feed in all at once.

Is a Bigger Window Always Better — Limits and Evidence

Intuitively a larger window seems better, but in practice that is not the case. In its official documentation, Anthropic calls the degradation in accuracy and recall as token counts rise context rot, stressing that what you include matters as much as how much you include.

The landmark study that demonstrated this limitation empirically is Liu et al. (2023), Lost in the Middle: How Language Models Use Long Contexts (arXiv:2307.03172, published in TACL). Using multi-document question answering and key-value retrieval tasks, the paper observed that performance is highest when the information needed for the answer sits at the start or end of the input, and drops sharply when it is positioned in the middle. As the location of the information was shifted, performance traced a U-shaped curve, high at both ends and low in the center, and this pattern held even for models marketed as long-context.

In sum, a larger context window only raises the maximum you can include; it offers no guarantee that the model will make equally good use of every piece of information inside it. In practice, then, rather than filling the window indiscriminately, it is more effective to place key information in favorable positions such as the beginning or end, or to trim away what is unnecessary. The work of designing what to place into this finite window and in what order is handled as a separate concept, context engineering. If the context window is the size of the bowl, context engineering is the question of what to put in that bowl and how.

References

Related terms

Context Engineering