ChatPaper.aiChatPaper

利用大型语言模型解决不等式证明问题

Solving Inequality Proofs with Large Language Models

June 9, 2025
作者: Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, Pan Lu
cs.AI

摘要

不等式证明,这一在众多科学与数学领域中至关重要的技能,考验着诸如发现紧致边界和策略性定理应用等高级推理能力。这使其成为大型语言模型(LLMs)面临的一个独特且具挑战性的前沿领域,提供了超越一般数学问题解决的洞见。然而,现有数据集往往稀缺、合成或过于形式化,阻碍了该领域的进展。为此,我们提出了一种非正式但可验证的任务框架,将不等式证明重构为两个可自动检查的子任务:边界估计与关系预测。基于此,我们发布了IneqMath,一个由专家精心策划的奥林匹克级别不等式数据集,包含测试集和训练语料库,并附有逐步解答及定理注释。此外,我们开发了一种新颖的LLM-as-judge评估框架,结合最终答案评判与四个逐步评判机制,旨在检测常见的推理缺陷。对29个领先LLMs在IneqMath上的系统评估揭示了一个令人惊讶的事实:即便如o1这样的顶级模型,在逐步审查下的总体准确率也不足10%;相较于仅考虑最终答案等价性时的准确率,这一下降幅度高达65.5%。这一差异暴露了当前LLMs在构建严谨证明与仅寻找答案之间的脆弱推理链及关键差距。扩大模型规模与增加测试时计算量对整体证明正确性的提升有限。相反,我们的发现强调了如定理引导推理和自我精炼等有前景的研究方向。代码与数据可在https://ineqmath.github.io/获取。
English
Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.
PDF162June 11, 2025