RE-IMAGINE：面向推理评估的符号基准合成

摘要

近期的大型语言模型（LLMs）在推理基准测试中报告了高准确率。然而，这些观察到的结果究竟是源于真正的推理能力，还是对训练集统计记忆的依赖，仍不明确。受因果阶梯理论（Pearl, 2009）及其三个层次（关联、干预与反事实）的启发，本文提出了RE-IMAGINE框架，旨在刻画LLMs推理能力的层次结构，并配套一个自动化流程，用于生成该层次结构不同层级上的问题变体。通过在中间符号表示层面调整问题，RE-IMAGINE能够生成任意数量仅凭记忆无法解决的问题。此外，该框架具有通用性，可应用于包括数学、代码和逻辑在内的多种推理领域。我们在四个广泛使用的基准测试上展示了该框架，评估了多个LLM家族，并观察到当模型面对问题变体时性能有所下降。这些评估揭示了模型过去表现中对统计记忆的一定依赖，为针对推理层次结构中各项技能的进一步研究打开了大门。

English

Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

RE-IMAGINE：面向推理评估的符号基准合成

RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

摘要

Support