Chain of Thought
Chain of Thought (CoT) is a prompting technique that boosts complex reasoning in large language models by getting them to spell out intermediate reasoning steps before committing to a final answer. Instead of jumping straight to the result, the model works through the problem, which sharply improves accuracy on arithmetic, commonsense, and symbolic tasks.
- Chain of Thought (CoT) raises a model's accuracy on hard reasoning tasks by prompting it to write out intermediate steps before stating the final answer.
- The two common variants are few-shot CoT, which shows the model a handful of worked examples, and zero-shot CoT, which simply appends the phrase "Let's think step by step."
- Wei et al. (2022, arXiv:2201.11903) pushed PaLM 540B to roughly 58% on the GSM8K math benchmark using just 8 examples, beating a fine-tuned GPT-3 with a verifier (55%).
- The effect is emergent: it only kicks in on large models, typically those above ~100B parameters.
- For GEO and AI search, CoT explains the reasoning path answer engines follow when they cite and summarize, and clearly structured, step-by-step content is easier to quote.
Overview
Chain of Thought (CoT) is a prompting technique that nudges a large language model (LLM) to write out its intermediate reasoning one step at a time rather than blurting out the answer. It mirrors how a person tackles a hard problem by working through it on paper. Forcing the model to show its work produces a noticeable jump in accuracy on tasks that require chaining several steps together, including arithmetic, commonsense, and symbolic reasoning.
The key point is that nothing about the model's weights changes. This is not fine-tuning; only the input prompt changes. Because it unlocks reasoning ability through prompt design alone, with no extra training cost, CoT became the starting point for the many reasoning-enhancement methods and "reasoning models" that followed.
How It Works
Chain of Thought is implemented in two main ways.
Few-Shot CoT
This is the original form proposed by Wei et al. (2022). You place a few exemplars in the prompt, each laid out as "problem → step-by-step solution → answer," which steers the model to unfold its own solution along the same pattern.
Q: A cafe has 23 apples. If it uses 20 at lunch and buys 6 more, how many apples does it have now?
A: It started with 23 apples. It used 20, so 23 - 20 = 3 are left.
It then bought 6 more, so 3 + 6 = 9. The answer is 9 apples.
Q: There are 3 cars in the lot and 2 more pull in. How many cars are there now?
A:The natural-language solution in the "A:" portion of each example is the chain of thought. Having seen this pattern, the model generates its reasoning first on a new problem instead of just emitting the answer.
Zero-Shot CoT
Kojima et al. (2022) showed that you can get the same effect with no examples at all, simply by appending the single phrase "Let's think step by step" after the question. Because there are no worked examples to author, it is trivial to apply.
Q: There are 16 juggling balls. Half are golf balls, and half of those golf balls are blue. How many blue golf balls are there?
A: Let's think step by step.One important caveat: an ablation by Wei et al. found that what actually drives the gain is the natural-language intermediate steps themselves. Merely producing more output tokens, or tacking an explanation on after the answer, did not yield the same improvement.
Standard Prompting vs Chain of Thought
| Aspect | Standard prompting | Few-shot CoT | Zero-shot CoT |
|---|---|---|---|
| Worked examples | Answer only | Include step-by-step solutions | No examples |
| Extra input | None | Hand-written solution examples | The phrase "Let's think step by step" |
| Intermediate steps in output | No | Yes | Yes |
| Ease of use | Easiest | Requires designing examples | Very easy |
| Key source | — | Wei et al. 2022 | Kojima et al. 2022 |
Evidence and Numbers
The benefit of chain of thought is quantified in two foundational papers.
- Few-shot CoT (Wei et al., 2022): Giving PaLM 540B just 8 chain-of-thought exemplars lifted its GSM8K accuracy to roughly 58% on this grade-school math word-problem benchmark, a state-of-the-art result at the time. That beat a GPT-3 model fine-tuned with a verifier, which reached 55%. The Google Research blog noted that combining CoT with the follow-up technique self-consistency raised the score on the same benchmark to 74%.
- Zero-shot CoT (Kojima et al., 2022): Adding only "Let's think step by step" to text-davinci-002 (InstructGPT) raised MultiArith accuracy from 17.7% to 78.7% and GSM8K accuracy from 10.4% to 40.7%. The striking part is that this gain came from a single sentence, with not one example.
Chain of thought does not, however, work on every model. Wei et al. and Google Research describe the technique as having an emergent property that only appears in large models above roughly 100B parameters. Smaller models tended to produce illogical solutions and sometimes scored worse than standard prompting.
Relevance to GEO and AI Search
Generative answer engines such as ChatGPT, Perplexity, and Google's AI Overviews internally go through a reasoning process much like chain of thought when answering a user's question. As a result, content built so that steps, evidence, and cause-and-effect are made explicit (numbered procedures, a definition-evidence-conclusion flow, tables) is easier for these engines to reason over and cite. Content with a clear reasoning path is more likely to be picked up for citation and summarization than vague prose.
Sources
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903
- Kojima, T. et al. (2022). Large Language Models are Zero-Shot Reasoners. NeurIPS 2022, arXiv:2205.11916
- Wei, J. & Zhou, D. (2022). Language Models Perform Reasoning via Chain of Thought. Google Research Blog