Microsoft Research Blog

女婴服用“伟哥”遭吐槽超说明书用药缺乏“国标”

百度 25岁的小杨在观赏了广场舞展示后，称可惜没有带妈妈来现场观看。

Published July 23, 2025

Share this page

Ladder of reasoning, reasoning gap, and benchmark synthesis pipeline

“Knowledge is limited. Imagination encircles the world.” -Albert Einstein

Reasoning systems have emerged as a focus of research on language models (LMs), as the field moves beyond surface-level language ability to target deeper cognitive skills. Reasoning, in this context, can be defined as the ability to follow a coherent sequence of steps in order to draw logical inferences, synthesize information, and construct solutions — rather than merely recalling facts or patterns.

The distinction between a coherent reasoning process and “mere recall” raises a core question: Given a language model, can we tell whether it is truly reasoning, or if its performance on math, logic, and coding benchmarks is still indicative only of strong pattern recognition and memorization?¹

Part of what makes this question difficult is the way reasoning skills are typically measured. Most contemporary methods for testing reasoning skills in LMs evaluate only the final answer, not the process by which solutions are derived. This creates an evaluation gap, allowing reasoning skills to appear stronger than they truly are. That is, correct answers – particularly on influential, publicly accessible tests such as the GSM8K elementary math benchmark – could also be achieved through statistical recall of the dataset, rather than the desired reasoning pathway.² By analogy, consider a student who reads the teacher’s answer key before an exam. The student may ace the test, but can we know for sure whether they really learned to think through the concepts?

Although today’s language models are trained on enormous datasets and often demonstrate encyclopedic knowledge, reasoning requires the ability to use prior knowledge and established principles to derive new conclusions. RE-IMAGINE probes exactly this capacity—can an LM rebuild and adapt its solution from first principles when the problem itself is systematically altered?

Climbing the ladder of reasoning

RE-IMAGINE synthesizes new reasoning benchmarks by (1) symbolically mutating the solution processes from existing benchmarks, and (2) asking language models to imagine what would happen if the corresponding aspect of the original problem were changed. This allows RE-IMAGINE to probe process, not just outcome, in the following sense: the mutated problems can all be solved via small modifications to the original solution code, and are designed to be no harder than the original problem to a reasoner using the “correct” strategy – but that same mutated problem would be intractable for any LM which only reproduces patterns from the original answer key without understanding the underlying method.

Identifying reasoning: Instead of using metrics to evaluate the model's answers, use surrogate metrics to evaluate the process used to obtain answers.

An example GSM8K problem and two different modifications at different levels of the ladder of reasoning.

The RE-IMAGINE pipeline synthesizes and compares performance on benchmark problems at three different levels, adapting Judea Pearl’s “Ladder of Causation” to the reasoning setting.³ Our new “Ladder of Reasoning” consists of the following hierarchy:

Level 1: Observation

This level captures the accuracy of LMs on existing benchmarks. It is called observe because we expect that models will have already seen similar problems in their training sets, and therefore, observational and knowledge association skills should suffice.

A sample GSM8K problem, represented as natural language, symbolic representation, and computational graph. — A sample problem from the GSM8K benchmark, with no modifications. The symbolic representation and computational graph represent a valid solution method for the problem, but a correct answer to the benchmark does not guarantee that a language model has used this method. Indeed, on a public benchmark like GSM8K, the correct numerical answer may also be *observed* in online databases.

Level 2: Mutation

This level captures the ability of LLMs to solve problems that have been mutated; for example, by adding irrelevant information, renaming values, or changing numbers.
For a robust reasoning model, task performance should not change after the mutations in this level, since they don’t impact the difficulty of the (correct) solution process.

Level 2 mutations have been explored by prior work, primarily using hand-written patterns and rules. For example, Mirzadeh et al. (2024)⁴ and Srivastava et al. (2024)⁵ have used functional templates to create variations of math problems in the GSM8K benchmark. RE-IMAGINE instead generates Level 2 mutations by a symbolic process which eliminates the need for hand-written templates; an advantage explored later in this post.

Level 2 mutation examples — The same GSM8K sample question, now with two different Level 2 mutations applied.

Level 3: Imagination

This level captures the models’ ability to incorporate new information and logic into existing problems. Level 3 augments each original problem with an additional logical predicate that changes a previously stated fact. This means that to solve the problem, a model needs to have an accurate (explicit or implicit) representation of the steps to solve the problem, as well as the ability to contradict and revise prior knowledge used in those steps.

Testing the ability to envision counterfactual worlds is a unique feature of RE-IMAGINE, building on the work of Gonzalez and Nori (2024)⁶.

Level 3 mutation examples — Various Level 3 mutations applied to the GSM8K sample problem. These mutations each ask the responder to consider a revision to a previous statement of the problem.

RE-IMAGINE generates problems at all three levels, allowing us to test and compare models on tasks throughout the reasoning hierarchy.

A synthesis pipeline for reasoning benchmarks

The RE-IMAGINE symbolic benchmark synthesis pipeline works in four parts:

Natural language-to-symbolic translation
Symbolic mutation,
Symbolic-to-natural language translation, and
Execution.

The first step translates a natural language problem statement into an executable symbolic form, such as a Python code snippet. The second applies a mutation from a user-specified mutation space to change the symbolic representation; for example, modifying the conditions of an if-then statement, adding spurious information, or changing a constant. The third step translates the mutated symbolic representation back to natural language, creating a novel mutated question. Importantly, this step changes based on which level of the reasoning hierarchy is being tested – for Level 3, LMs are presented with the original question and then asked about the effect of applying the change, whereas for Level 2, the change is applied directly to the original problem before it is presented to the model. The fourth and final step then executes the modified symbolic code to determine the ground-truth answer for this new question.

The RE-IMAGINE pipeline for generating reasoning benchmarks

Notably, the auto-translation itself relies on the use of LMs, and care must be taken to ensure correctness. The RE-IMAGINE pipeline includes various safeguards to protect against errors during the translation steps: Validation is performed through back-translation, execution verification, manual review, and consistency checks. These steps ensure that the generated symbolic problems are accurately translated back into natural language, the ground-truth answers are correct, and the logical structure of the problems is maintained.

Revealing the reasoning gap

Applying RE-IMAGINE testing to commonly used LMs exposes the extent to which these models still struggle to perform tasks beyond Level 1 of the reasoning hierarchy. In particular, Level-3 mutations pose the greatest challenge: accuracy on two-step Level-3 variants fall well below that on six-step Level-1 examples, underscoring the inflated test scores created by benchmarks that rely solely on final-answer correctness.

Initial experiments tested the framework on four widely-used benchmarks: GSM8K for math, CLadder for causality, CruxEval for code understanding, and Loop for loop invariant inference. The results indicate a consistent decline in LM performance as reasoning complexity increases across all evaluated benchmarks.⁷

Model accuracy results on CruxEval and GSM8K, showing that — On the GSM8K benchmark, models show high accuracy on Level 1 problems (“Raw”), but experience a significant drop in performance on Level 2 (“Sample Values”, “UselessInfo”) and Level 3 (“CounterFactual”, “InsertConditional”, “AddDependence”) problems. Similar reductions in accuracy are also observed on problems from the CruxEval benchmark, with each problem variation implemented in both a Level 2 and a Level 3 version.

Problems at higher levels in the reasoning hierarchy, particularly those in Level 3, remain unsolved, with significantly reduced accuracy scores across all benchmarks and LLMs. These findings highlight the reliance on statistical recall for Level 1 performance, and the subsequent challenges faced by LMs in solving higher-level reasoning tasks.

A scalable solution

The RE-IMAGINE schema introduces a first-of-its-kind scalable mutation generation pipeline that applies across multiple benchmarks and tasks. This framework enables the creation of an arbitrary number of mutations at each level of the hierarchy for existing benchmark problems.

Leveraging symbolic representations of problems such as functional templates (Mirzadeh et al., 2024; Srivastava et al., 2024), reasoning or causal graphs (González & Nori, 2024; Huyuk et al., 2024; Yang et al., 2024), planning tasks (Valmeekam et al., 2022) or code (Li et al., 2024) has become a common strategy for creating problem variations. However, prior approaches to this problem were limited in scope as well as in the level of the reasoning hierarchy they addressed.

In contrast, RE-IMAGINE applies across domains such as math, code, and logic, and for each benchmark, problem variations are created by symbolically altering the solution code, requiring only simple end-user coding to implement new mutations. Through this process, the number of problems generated is limited only by the space of allowed mutations, allowing orders of magnitude higher scaling; in the case of GSM8K, this results in thousands of unique problems.

What’s next?

RE-IMAGINE provides a robust method to disentangle genuine reasoning from statistical recall, enabling researchers and users to look critically at claims about reasoning in AI systems. Looking to the future, our recent integration of RE-IMAGINE with the existing EUREKA evaluation framework, along with new directions using synthetic data from the pipeline for reinforcement learning training, could enhance the ability of LLMs to handle more complex and dynamic reasoning tasks. With continued advancements towards models with truly generalizable capabilities, we can imagine a world in which AI reasoning is truly transformative.

References

Mitchell & Krakauer, 2023 ??
Zhou et al., 2023 ??
Pearl, 2009 ??
Mirzadeh et al., 2024 ??
Srivastava et al., 2024 ??
Gonzalez & Nori, 2024 ??
GSM8K (Cobbe et al., 2021), CLadder (Jin et al., 2023), CRUXEval (Gu et al., 2024), and Loop (Kamath et al., 2024) ??

Related publications

Re-Imagine: Symbolic Benchmark Synthesis for Reasoning Evaluation

Continue reading

July 8, 2025

Theoretical foundation of large language models: Microsoft Research Asia StarTrack Scholars 2025 enhancing the power of LLMs

See all blog posts

Research Areas

Artificial intelligence

Research Groups

Causality and Machine Learning

Related labs

Microsoft Research Lab - Cambridge

Related publications

Re-Imagine: Symbolic Benchmark Synthesis for Reasoning Evaluation

顾家什么意思	端午节应该吃什么	电梯房什么楼层最好	绿色的鸟是什么鸟	83年属什么生肖
饱和脂肪酸是什么意思	日代表什么生肖	喝酒后头晕是什么原因	777是什么意思	晟这个字念什么
台湾人说什么语言	天秤女喜欢什么样的男生	为什么牙齿会发黑	7.2号是什么星座	老是瞌睡是什么原因
吃什么长胎	775是什么意思	举案齐眉是什么意思	我拿什么留住你	多吃菠萝有什么好处

山药为什么煮熟了也麻口hcv8jop9ns4r.cn	绌是什么意思hcv7jop6ns0r.cn	阴道干燥是什么原因hcv9jop2ns7r.cn	寄生虫是什么意思hcv8jop1ns9r.cn	2月出生是什么星座hkuteam.com
脚趾麻木是什么病先兆hcv8jop1ns7r.cn	窦性心动过缓是什么病hcv8jop1ns3r.cn	蛞蝓是什么意思hcv9jop4ns6r.cn	处变不惊是什么意思hcv7jop9ns2r.cn	环切是什么意思hcv9jop2ns3r.cn
上皮源性肿瘤什么意思hcv8jop4ns7r.cn	为什么一躺下就鼻塞hcv8jop3ns8r.cn	为什么叫西瓜naasee.com	电气火灾用什么灭火hcv7jop9ns7r.cn	女人最想要什么hcv8jop2ns9r.cn
蒲公英和玫瑰花一起泡有什么功效jasonfriends.com	为什么一hcv9jop0ns7r.cn	小时的单位是什么hcv9jop0ns1r.cn	脸大适合什么发型hcv9jop6ns6r.cn	什么茶去火hcv9jop7ns9r.cn

Microsoft Research Blog

女婴服用“伟哥”遭吐槽超说明书用药缺乏“国标”

Climbing the ladder of reasoning

Level 1: Observation

Level 2: Mutation

Level 3: Imagination

A synthesis pipeline for reasoning benchmarks

Revealing the reasoning gap

A scalable solution

What’s next?

References

Related publications

Re-Imagine: Symbolic Benchmark Synthesis for Reasoning Evaluation

Continue reading

Phi-Reasoning: Once again redefining what is possible with small and efficient AI?

Teaching LLMs to think: Xian Zhang on advancing mathematical reasoning in AI

LLMs for safe low-level programming

Theoretical foundation of large language models: Microsoft Research Asia StarTrack Scholars 2025 enhancing the power of LLMs

Research Areas

Research Groups

Related labs

Related publications