SoS1：O1與R1類推理大型語言模型是平方和求解器

摘要

大型語言模型（LLMs）在多樣化的任務中已達到人類水平，但其在嚴格數學問題解決上的能力仍是一個開放性挑戰。本研究探討了一個基礎但計算上棘手的問題：判定給定的多元多項式是否非負。此問題與希爾伯特第十七問題密切相關，在全局多項式優化中扮演關鍵角色，並在多個領域具有應用價值。首先，我們引入了SoS-1K，這是一個精心策劃的約1,000個多項式的數據集，並附有基於五個逐步挑戰性標準的專家設計推理指導。評估多個最先進的LLMs後，我們發現，在缺乏結構化指導的情況下，所有模型的表現僅略高於隨機猜測的基準50%。然而，高質量的推理指導顯著提升了準確率，將性能提升至81%。此外，我們在SoS-1K上僅微調4小時的7B模型SoS-7B，在準確性上超越了671B的DeepSeek-V3和GPT-4o-mini，而所需的計算時間僅分別為後者的1.8%和5%。我們的研究結果凸顯了LLMs在推動數學推理邊界及應對NP難題方面的潛力。

English

Large Language Models (LLMs) have achieved human-level proficiency across diverse tasks, but their ability to perform rigorous mathematical problem solving remains an open challenge. In this work, we investigate a fundamental yet computationally intractable problem: determining whether a given multivariate polynomial is nonnegative. This problem, closely related to Hilbert's Seventeenth Problem, plays a crucial role in global polynomial optimization and has applications in various fields. First, we introduce SoS-1K, a meticulously curated dataset of approximately 1,000 polynomials, along with expert-designed reasoning instructions based on five progressively challenging criteria. Evaluating multiple state-of-the-art LLMs, we find that without structured guidance, all models perform only slightly above the random guess baseline 50%. However, high-quality reasoning instructions significantly improve accuracy, boosting performance up to 81%. Furthermore, our 7B model, SoS-7B, fine-tuned on SoS-1K for just 4 hours, outperforms the 671B DeepSeek-V3 and GPT-4o-mini in accuracy while only requiring 1.8% and 5% of the computation time needed for letters, respectively. Our findings highlight the potential of LLMs to push the boundaries of mathematical reasoning and tackle NP-hard problems.

SoS1：O1與R1類推理大型語言模型是平方和求解器

SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers

摘要

Support