Microsoft Research Blog

女婴服用“伟哥”遭吐槽 超说明书用药缺乏“国标”

百度 25岁的小杨在观赏了广场舞展示后,称可惜没有带妈妈来现场观看。

Published

By Rachel Lawrence, Researcher

Ladder of reasoning, reasoning gap, and benchmark synthesis pipeline

“Knowledge is limited. Imagination encircles the world.”  -Albert Einstein

Reasoning systems have emerged as a focus of research on language models (LMs), as the field moves beyond surface-level language ability to target deeper cognitive skills. Reasoning, in this context, can be defined as the ability to follow a coherent sequence of steps in order to draw logical inferences, synthesize information, and construct solutions — rather than merely recalling facts or patterns.

The distinction between a coherent reasoning process and “mere recall” raises a core question: Given a language model, can we tell whether it is truly reasoning, or if its performance on math, logic, and coding benchmarks is still indicative only of strong pattern recognition and memorization?1

Part of what makes this question difficult is the way reasoning skills are typically measured. Most contemporary methods for testing reasoning skills in LMs evaluate only the final answer, not the process by which solutions are derived. This creates an evaluation gap, allowing reasoning skills to appear stronger than they truly are. That is, correct answers – particularly on influential, publicly accessible tests such as the GSM8K elementary math benchmark – could also be achieved through statistical recall of the dataset, rather than the desired reasoning pathway.2 By analogy, consider a student who reads the teacher’s answer key before an exam. The student may ace the test, but can we know for sure whether they really learned to think through the concepts?

Although today’s language models are trained on enormous datasets and often demonstrate encyclopedic knowledge, reasoning requires the ability to use prior knowledge and established principles to derive new conclusions. RE-IMAGINE probes exactly this capacity—can an LM rebuild and adapt its solution from first principles when the problem itself is systematically altered?

Climbing the ladder of reasoning

RE-IMAGINE synthesizes new reasoning benchmarks by (1) symbolically mutating the solution processes from existing benchmarks, and (2) asking language models to imagine what would happen if the corresponding aspect of the original problem were changed. This allows RE-IMAGINE to probe process, not just outcome, in the following sense: the mutated problems can all be solved via small modifications to the original solution code, and are designed to be no harder than the original problem to a reasoner using the “correct” strategy – but that same mutated problem would be intractable for any LM which only reproduces patterns from the original answer key without understanding the underlying method.

Identifying reasoning: Instead of using metrics to evaluate the model's answers, use surrogate metrics to evaluate the process used to obtain answers.
An example GSM8K problem and two different modifications at different levels of the ladder of reasoning.

The RE-IMAGINE pipeline synthesizes and compares performance on benchmark problems at three different levels, adapting Judea Pearl’s “Ladder of Causation” to the reasoning setting.3 Our new “Ladder of Reasoning”  consists of the following hierarchy:

Level 1: Observation

This level captures the accuracy of LMs on existing benchmarks. It is called observe because we expect that models will have already seen similar problems in their training sets, and therefore, observational and knowledge association skills should suffice.

A sample GSM8K problem, represented as natural language, symbolic representation, and computational graph.
A sample problem from the GSM8K benchmark, with no modifications. The symbolic representation and computational graph represent a valid solution method for the problem, but a correct answer to the benchmark does not guarantee that a language model has used this method. Indeed, on a public benchmark like GSM8K, the correct numerical answer may also be observed in online databases.

Level 2: Mutation

This level captures the ability of LLMs to solve problems that have been mutated; for example, by adding irrelevant information, renaming values, or changing numbers.
For a robust reasoning model, task performance should not change after the mutations in this level, since they don’t impact the difficulty of the (correct) solution process.

Level 2 mutations have been explored by prior work, primarily using hand-written patterns and rules. For example, Mirzadeh et al. (2024)4 and Srivastava et al. (2024)5 have used functional templates to create variations of math problems in the GSM8K benchmark. RE-IMAGINE instead generates Level 2 mutations by a symbolic process which eliminates the need for hand-written templates; an advantage explored later in this post.

Level 2 mutation examples
The same GSM8K sample question, now with two different Level 2 mutations applied.

Level 3: Imagination

This level captures the models’ ability to incorporate new information and logic into existing problems. Level 3 augments each original problem with an additional logical predicate that changes a previously stated fact. This means that to solve the problem, a model needs to have an accurate (explicit or implicit) representation of the steps to solve the problem, as well as the ability to contradict and revise prior knowledge used in those steps.

Testing the ability to envision counterfactual worlds is a unique feature of RE-IMAGINE, building on the work of Gonzalez and Nori (2024)6.

Level 3 mutation examples
Various Level 3 mutations applied to the GSM8K sample problem. These mutations each ask the responder to consider a revision to a previous statement of the problem.

RE-IMAGINE generates problems at all three levels, allowing us to test and compare models on tasks throughout the reasoning hierarchy.

A synthesis pipeline for reasoning benchmarks

The RE-IMAGINE symbolic benchmark synthesis pipeline works in four parts:

  1. Natural language-to-symbolic translation
  2. Symbolic mutation,
  3. Symbolic-to-natural language translation, and
  4. Execution.

The first step translates a natural language problem statement into an executable symbolic form, such as a Python code snippet. The second applies a mutation from a user-specified mutation space to change the symbolic representation; for example, modifying the conditions of an if-then statement, adding spurious information, or changing a constant. The third step translates the mutated symbolic representation back to natural language, creating a novel mutated question. Importantly, this step changes based on which level of the reasoning hierarchy is being tested – for Level 3, LMs are presented with the original question and then asked about the effect of applying the change, whereas for Level 2, the change is applied directly to the original problem before it is presented to the model.  The fourth and final step then executes the modified symbolic code to determine the ground-truth answer for this new question.

The RE-IMAGINE pipeline for generating reasoning benchmarks

Notably, the auto-translation itself relies on the use of LMs, and care must be taken to ensure correctness. The RE-IMAGINE pipeline includes various safeguards to protect against errors during the translation steps: Validation is performed through back-translation, execution verification, manual review, and consistency checks. These steps ensure that the generated symbolic problems are accurately translated back into natural language, the ground-truth answers are correct, and the logical structure of the problems is maintained.  

Revealing the reasoning gap

Applying RE-IMAGINE testing to commonly used LMs exposes the extent to which these models still struggle to perform tasks beyond Level 1 of the reasoning hierarchy. In particular, Level-3 mutations pose the greatest challenge: accuracy on two-step Level-3 variants fall well below that on six-step Level-1 examples, underscoring the inflated test scores created by benchmarks that rely solely on final-answer correctness.

Initial experiments tested the framework on four widely-used benchmarks: GSM8K for math, CLadder for causality, CruxEval for code understanding, and Loop for loop invariant inference. The results indicate a consistent decline in LM performance as reasoning complexity increases across all evaluated benchmarks.7

Model accuracy results on CruxEval and GSM8K, showing that
On the GSM8K benchmark, models show high accuracy on Level 1 problems (“Raw”), but experience a significant drop in performance on Level 2 (“Sample Values”, “UselessInfo”) and Level 3 (“CounterFactual”, “InsertConditional”, “AddDependence”) problems. Similar reductions in accuracy are also observed on problems from the CruxEval benchmark, with each problem variation implemented in both a Level 2 and a Level 3 version.

Problems at higher levels in the reasoning hierarchy, particularly those in Level 3, remain unsolved, with significantly reduced accuracy scores across all benchmarks and LLMs. These findings highlight the reliance on statistical recall for Level 1 performance, and the subsequent challenges faced by LMs in solving higher-level reasoning tasks.

A scalable solution

The RE-IMAGINE schema introduces a first-of-its-kind scalable mutation generation pipeline that applies across multiple benchmarks and tasks. This framework enables the creation of an arbitrary number of mutations at each level of the hierarchy for existing benchmark problems.

Leveraging symbolic representations of problems such as functional templates (Mirzadeh et al., 2024; Srivastava et al., 2024), reasoning or causal graphs (González & Nori, 2024; Huyuk et al., 2024; Yang et al., 2024), planning tasks (Valmeekam et al., 2022) or code (Li et al., 2024) has become a common strategy for creating problem variations. However, prior approaches to this problem were limited in scope as well as in the level of the reasoning hierarchy they addressed.

In contrast, RE-IMAGINE applies across domains such as math, code, and logic, and for each benchmark, problem variations are created by symbolically altering the solution code, requiring only simple end-user coding to implement new mutations. Through this process, the number of problems generated is limited only by the space of allowed mutations, allowing orders of magnitude higher scaling; in the case of GSM8K, this results in thousands of unique problems.

What’s next?

RE-IMAGINE provides a robust method to disentangle genuine reasoning from statistical recall, enabling researchers and users to look critically at claims about reasoning in AI systems.  Looking to the future, our recent integration of RE-IMAGINE with the existing EUREKA evaluation framework, along with new directions using synthetic data from the pipeline for reinforcement learning training, could enhance the ability of LLMs to handle more complex and dynamic reasoning tasks. With continued advancements towards models with truly generalizable capabilities, we can imagine a world in which AI reasoning is truly transformative.


References

  1. Mitchell & Krakauer, 2023 ??
  2. Zhou et al., 2023 ??
  3. Pearl, 2009 ??
  4. Mirzadeh et al., 2024 ??
  5. Srivastava et al., 2024 ??
  6. Gonzalez & Nori, 2024 ??
  7. GSM8K (Cobbe et al., 2021), CLadder (Jin et al., 2023), CRUXEval (Gu et al., 2024), and Loop (Kamath et al., 2024) ??

Related publications

Continue reading

See all blog posts