RE-IMAGINE: 推論評価のための記号的ベンチマーク合成

要旨

最近の大規模言語モデル（LLMs）は、推論ベンチマークにおいて高い精度を報告しています。しかし、観察された結果が真の推論によるものか、それとも訓練セットの統計的回想起因によるものかは依然として不明です。因果関係の階梯（Pearl, 2009）とその3つのレベル（関連性、介入、反事実）に着想を得て、本論文ではRE-IMAGINEを紹介します。これは、LLMsの推論能力の階層を特徴づけるフレームワークであり、階層の異なるレベルで問題のバリエーションを生成する自動化されたパイプラインを備えています。中間的なシンボリック表現で問題を変更することにより、RE-IMAGINEは記憶だけでは解けない任意に多くの問題を生成します。さらに、このフレームワークは一般的であり、数学、コード、論理などの推論領域にわたって機能します。我々は、広く使用されている4つのベンチマークでこのフレームワークを実証し、複数のLLMsファミリーを評価しました。その結果、問題のバリエーションでモデルに問い合わせた際に性能の低下が観察されました。これらの評価は、過去の性能に対する統計的回想への依存度を示しており、推論階層全体にわたるスキルを対象としたさらなる研究への扉を開くものです。

English

Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

RE-IMAGINE: 推論評価のための記号的ベンチマーク合成

RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

要旨

Support