大規模言語モデルを用いた不等式証明の解法

要旨

不等式証明は、多様な科学および数学分野において重要な役割を果たし、厳密な境界の発見や戦略的な定理の適用といった高度な推論スキルを試すものである。これにより、大規模言語モデル（LLMs）にとって、一般的な数学的問題解決を超えた洞察を提供する、独特で要求の厳しいフロンティアとなっている。この分野の進展は、既存のデータセットがしばしば不足している、合成的である、または厳密に形式的であることによって妨げられている。我々は、この問題に対処するために、非形式的でありながら検証可能なタスク定式化を提案し、不等式証明を自動的にチェック可能な2つのサブタスク、すなわち境界推定と関係予測に再構築する。これに基づいて、オリンピアドレベルの不等式を含む専門家がキュレートしたデータセットIneqMathを公開し、段階的な解法と定理注釈を充実させたテストセットとトレーニングコーパスを提供する。さらに、最終回答の判定者と、一般的な推論の欠陥を検出するために設計された4つの段階的な判定者を組み合わせた、新しいLLM-as-judge評価フレームワークを開発する。IneqMathにおける29の主要なLLMsの体系的な評価は、驚くべき現実を明らかにする：o1のようなトップモデルでさえ、段階的な精査の下では全体の精度が10%未満であり、これは最終回答の等価性のみを考慮した場合の精度から最大65.5%の低下である。この不一致は、脆弱な演繹的連鎖と、現在のLLMsが単に答えを見つけることと厳密な証明を構築することとの間に存在する重要なギャップを暴露する。モデルサイズの拡大やテスト時の計算量の増加は、全体の証明の正確性において限定的な改善しかもたらさない。代わりに、我々の調査結果は、定理に基づく推論や自己改善といった有望な研究方向性を強調する。コードとデータはhttps://ineqmath.github.io/で利用可能である。

English

Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.

大規模言語モデルを用いた不等式証明の解法

Solving Inequality Proofs with Large Language Models

要旨

Support