SoS1:O1與R1類推理大型語言模型是平方和求解器
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
February 27, 2025
作者: Kechen Li, Wenqi Zhu, Coralia Cartis, Tianbo Ji, Shiwei Liu
cs.AI
摘要
大型語言模型(LLMs)在多樣化的任務中已達到人類水平,但其在嚴格數學問題解決上的能力仍是一個開放性挑戰。本研究探討了一個基礎但計算上棘手的問題:判定給定的多元多項式是否非負。此問題與希爾伯特第十七問題密切相關,在全局多項式優化中扮演關鍵角色,並在多個領域具有應用價值。首先,我們引入了SoS-1K,這是一個精心策劃的約1,000個多項式的數據集,並附有基於五個逐步挑戰性標準的專家設計推理指導。評估多個最先進的LLMs後,我們發現,在缺乏結構化指導的情況下,所有模型的表現僅略高於隨機猜測的基準50%。然而,高質量的推理指導顯著提升了準確率,將性能提升至81%。此外,我們在SoS-1K上僅微調4小時的7B模型SoS-7B,在準確性上超越了671B的DeepSeek-V3和GPT-4o-mini,而所需的計算時間僅分別為後者的1.8%和5%。我們的研究結果凸顯了LLMs在推動數學推理邊界及應對NP難題方面的潛力。
English
Large Language Models (LLMs) have achieved human-level proficiency across
diverse tasks, but their ability to perform rigorous mathematical problem
solving remains an open challenge. In this work, we investigate a fundamental
yet computationally intractable problem: determining whether a given
multivariate polynomial is nonnegative. This problem, closely related to
Hilbert's Seventeenth Problem, plays a crucial role in global polynomial
optimization and has applications in various fields. First, we introduce
SoS-1K, a meticulously curated dataset of approximately 1,000 polynomials,
along with expert-designed reasoning instructions based on five progressively
challenging criteria. Evaluating multiple state-of-the-art LLMs, we find that
without structured guidance, all models perform only slightly above the random
guess baseline 50%. However, high-quality reasoning instructions significantly
improve accuracy, boosting performance up to 81%. Furthermore, our 7B model,
SoS-7B, fine-tuned on SoS-1K for just 4 hours, outperforms the 671B DeepSeek-V3
and GPT-4o-mini in accuracy while only requiring 1.8% and 5% of the computation
time needed for letters, respectively. Our findings highlight the potential of
LLMs to push the boundaries of mathematical reasoning and tackle NP-hard
problems.Summary
AI-Generated Summary