言語モデルは反証できるか？反例生成によるアルゴリズム的推論の評価

要旨

言語モデル（LMs）が科学的発見を加速する可能性について、期待が高まっています。仮説の反証は科学的進歩の鍵であり、主張を時間をかけて反復的に洗練することを可能にします。このプロセスには、研究者の多大な努力、推論、そして創意工夫が必要です。しかし、現在のLMsのベンチマークは、主に解決策を生成する能力を評価するものであり、それらに挑戦する能力を評価するものではありません。私たちは、この逆の能力——微妙に誤った解決策に対する反例を作成する能力——を評価するベンチマークの開発を提唱します。このアプローチを実証するために、コード実行を用いて反例を自動的に評価できるアルゴリズム問題解決の領域から始めます。具体的には、REFUTEを紹介します。これは、プログラミングコンテストからの最近の問題と誤った提出を含む動的に更新されるベンチマークであり、人間の専門家が成功裏に反例を特定したものです。私たちの分析によると、最高の推論エージェントでさえ、OpenAI o3-mini（高）のようなコード実行フィードバックを備えたものでも、REFUTEの誤った解決策に対して反例を作成できるのは<9%に過ぎません。一方で、評価によれば、これらの問題の最大48%をゼロから解決する能力があるとされています。私たちの研究が、誤った解決策を反証するLMsの能力を評価し、向上させるための進展を促すことを願っています。この能力は、研究を加速し、モデルが信頼できる反省的推論を通じて自己改善するために不可欠です。

English

There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only <9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs' ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.

言語モデルは反証できるか？反例生成によるアルゴリズム的推論の評価

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

要旨

Support