Chain-of-Thought Prompting: How Step-by-Step Reasoning Improves LLM Accuracy

Chain-of-thought (CoT) prompting is a technique where LLMs are instructed to show their reasoning step by step before arriving at an answer. Introduced by Wei et al. at Google Brain (2022), CoT dramatically improved performance on math, logic, and multi-step reasoning tasks — sometimes by 30-50 percentage points. The mechanism: forcing explicit intermediate steps prevents the model from jumping to conclusions and allows error correction within the reasoning chain. However, 2026 research shows that CoT can backfire — excessive verbosity in larger models sometimes degrades accuracy.

Chain-of-thought (CoT) prompting is a technique where a large language model is instructed to produce intermediate reasoning steps before arriving at a final answer, rather than generating the answer directly. The approach was formalized by Wei et al. at Google Brain in their 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." ## How It Works Instead of prompting "What is 17 × 24?" and expecting "408" directly, CoT prompting asks the model to "think step by step": decompose 17 × 24 into (17 × 20) + (17 × 4) = 340 + 68 = 408. The intermediate steps are generated as text tokens, and each step conditions the model's next step — creating a sequential reasoning trace. ## Why It Helps **Error prevention:** Direct answer generation requires the model to "get it right in one shot" — a single forward pass through the network. Step-by-step reasoning breaks complex problems into simpler subproblems, each of which the model is more likely to handle correctly. **Self-correction:** By producing intermediate results as visible text, the model can notice and correct errors in earlier steps (to some degree) when generating later steps. **Scaling:** CoT primarily benefits larger models (typically 100B+ parameters). Smaller models often produce incoherent reasoning chains that don't improve accuracy. This suggests CoT leverages capabilities that only emerge at scale. ## Performance Impact On the GSM8K math benchmark, CoT improved accuracy from 17.9% to 58.1% for PaLM 540B — a 40+ percentage point improvement from simply adding "Let's think step by step" to the prompt. Similar gains appear on multi-step logic, common sense reasoning, and code generation tasks. ## The Verbosity Trade-off 2026 research ("Brevity Constraints Reverse Performance Hierarchies in Language Models," Hakim) revealed that on approximately 7.7% of benchmark problems, larger models underperform smaller ones specifically because they generate excessively long, wandering reasoning chains. The verbose paths introduce errors, contradictions, and tangential exploration. Forcing brevity constraints reverses this effect. Caveman Skill and the Brevity Research: 65-75% Token Reduction That Improves LLM Accuracy This suggests an optimal reasoning length exists: enough steps to decompose the problem, but not so many that the model loses track of its own argument. ## Variants **Zero-shot CoT:** Simply appending "Let's think step by step" to any prompt. **Few-shot CoT:** Providing example problems with step-by-step solutions in the prompt. **Self-consistency:** Generating multiple independent reasoning chains and taking the majority-vote answer. **Tree of Thoughts:** Branching exploration of multiple reasoning paths simultaneously.

Chain-of-Thought Prompting: How Step-by-Step Reasoning Improves LLM Accuracy

Related Knowledge

Caveman Skill and the Brevity Research: 65-75% Token Reduction That Improves LLM Accuracy

Have insights to add?