RE-IMAGINE：推理評估的符號基準合成

摘要

近期的大型語言模型（LLMs）在推理基準測試中報告了高準確率。然而，尚不清楚這些觀察到的結果是源於真正的推理能力，還是來自對訓練集統計記憶的召回。受因果階梯（Pearl, 2009）及其三個層次（關聯、干預和反事實）的啟發，本文引入了RE-IMAGINE框架，旨在刻畫LLMs推理能力的層次結構，並提供一個自動化流程，以生成該層次結構不同層級上的問題變體。通過在中間符號表示層面改變問題，RE-IMAGINE能夠生成任意數量僅憑記憶無法解決的問題。此外，該框架具有通用性，可跨推理領域（包括數學、代碼和邏輯）應用。我們在四個廣泛使用的基準測試上展示了該框架，並評估了多個LLMs家族，觀察到當模型面對問題變體時性能有所下降。這些評估表明，模型在過去表現中對統計記憶存在一定程度的依賴，並為針對推理層次結構中各項技能的進一步研究打開了大門。

English

Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

RE-IMAGINE：推理評估的符號基準合成

RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

摘要

Support