Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Wei 2022 - CoT Prompting in LLMs

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

This is an early paper that sparked the explosion of research on in-context learning for LLMs. The main idea is to use few-shot examples with the Chain-Of-Thought reasoning, i.e. <Question>, <CoT reasoning>, <Answer> , as opposed to just the question and answer alone <Question>, <Answer> (as per Brown 2020 - Language Models are Few-Shot Learners).

Method

The method is simple - we include the CoT reasoning in the few-shot examples included in the prompt, inducing the LLM to also generate CoT reasoning before answering the question at inference time. An example prompt:

Q: Roger has 3 balls. He buys 2 more cans of balls, each with 3 balls.
How many balls does he have now?

A: Roger started with 3 balls. 2 cans of 3 balls each is 6 balls.
3 + 6 = 9. The answer is 9.

What are some advantages of CoT prompting?

  • Useable on black box LLMs. No fine-tuning is required, so we can readily apply it on off-the-shelf LLMs.
  • CoT allows decomposition of complex problems into intermediate steps. The conceptual idea is that this allows the model to offload additional computation to the intermediate tokens, analogous to human reasoning.
  • CoT reasoning offers interpretation of model behaviour. We have a suggestive trace of how the model arrived at that answer (though it would be naive to assume that the LLM uses the CoT trace reasoning exactly the way human logic operates).

Observations

An important sub-finding is that between CoT prompting and standard prompting (as per Brown 2020), CoT prompting's improvement gap significantly increases as we:

  • Increase the question difficulty (seems intuitive); and
  • Increase the model size. At the 8B parameter model size, the study showed no difference between CoT prompting and standard prompting. But the gap widened significantly at the 100B model size and widened further at the 500B model size.

Another interesting ablation study tried to isolate the improvements to a specific aspect of prompting:

  • Equation only. This would be something like 3 + 2 * 3 = 9 for the above example. This showed to be no different from standard prompting. This showed to be useful on some datasets with simpler steps, but did not help on GSM8K which requires more semantic parsing of the question.
  • Variable compute only. One may argue that the exact tokens generated in the intermediate step does not matter much, all that matters is the additional compute the model performs to generate the intermediate tokens. Hence the authors prompt the model to generate dots ... as the reasoning step instead. This proves to not be helpful.
  • CoT after answer. Another argument is that including the CoT traces in the prompt improves in-context learning in and of themselves, meaning that the intermediate tokens are not actually necessary for improving the model's accuracy. The authors disprove this hypothesis by putting the CoT reasoning after the answer in the prompt, i.e. <Question, Answer, CoT reasoning>. This forces the LLM to generate the answer before it generates the CoT trace. This also proves to not be helpful.

Thus the ablation studies help to clarify that it is the intermediate natural language reasoning steps that help the model offload computation and improve the accuracy of its answers.

Results

The performance of CoT few-shot prompting compared to standard few-shot prompting is striking:

  • Using GPT-3 175B, performance increases from 15.6 to 46.9
  • Using PaLM 540B, performance increases from 17.9 to 56.9