ChatPaper.aiChatPaper

利用大型語言模型解決不等式證明問題

Solving Inequality Proofs with Large Language Models

June 9, 2025
作者: Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, Pan Lu
cs.AI

摘要

不等式證明,作為跨越多個科學與數學領域的關鍵技能,考驗著發現緊密界限和策略性定理應用等高級推理能力。這使其成為大型語言模型(LLMs)一個獨特且具挑戰性的前沿領域,提供了超越一般數學問題解決的洞見。然而,現有數據集往往稀缺、合成或過於形式化,阻礙了這一領域的進展。為此,我們提出了一種非正式但可驗證的任務框架,將不等式證明重新表述為兩個可自動檢查的子任務:界限估計與關係預測。基於此,我們發布了IneqMath,這是一個由專家精心策劃的奧林匹克級不等式數據集,包含測試集和訓練語料庫,並附有逐步解答和定理註釋。此外,我們開發了一種新穎的LLM-as-judge評估框架,結合了最終答案評判與四個旨在檢測常見推理缺陷的逐步評判。對29個領先LLMs在IneqMath上的系統評估揭示了一個令人驚訝的事實:即使在逐步審查下,頂級模型如o1的總體準確率也不足10%;與僅考慮最終答案等價性時的準確率相比,下降了高達65.5%。這一差異暴露了脆弱的推理鏈,以及當前LLMs在僅僅找到答案與構建嚴密證明之間存在的關鍵差距。擴大模型規模和增加測試時計算量對整體證明正確性的提升有限。相反,我們的研究結果指出了定理引導推理和自我精煉等有前景的研究方向。代碼和數據可在https://ineqmath.github.io/獲取。
English
Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.
PDF172June 11, 2025