SoS1: O1およびR1に類似した推論LLMは二乗和ソルバーである

要旨

大規模言語モデル（LLM）は多様なタスクにおいて人間レベルの熟達度を達成していますが、厳密な数学的問題解決能力は依然として未解決の課題です。本研究では、基本的でありながら計算的に困難な問題、すなわち与えられた多変数多項式が非負であるかどうかを判定する問題に取り組みます。この問題は、ヒルベルトの第17問題と密接に関連しており、グローバル多項式最適化において重要な役割を果たし、さまざまな分野での応用があります。まず、約1,000の多項式からなる注意深く選ばれたデータセットSoS-1Kと、5段階の難易度に基づいて専門家が設計した推論指示を紹介します。複数の最先端LLMを評価した結果、構造化されたガイダンスなしでは、すべてのモデルがランダム推測のベースライン50%をわずかに上回る程度の性能しか示しませんでした。しかし、高品質の推論指示は精度を大幅に向上させ、性能を最大81%まで引き上げました。さらに、SoS-1Kでわずか4時間ファインチューニングした7BモデルSoS-7Bは、671BのDeepSeek-V3やGPT-4o-miniを精度で上回りながら、それぞれ必要な計算時間の1.8%と5%しか必要としませんでした。我々の知見は、LLMが数学的推論の限界を押し広げ、NP困難問題に取り組む可能性を強調しています。

English

Large Language Models (LLMs) have achieved human-level proficiency across diverse tasks, but their ability to perform rigorous mathematical problem solving remains an open challenge. In this work, we investigate a fundamental yet computationally intractable problem: determining whether a given multivariate polynomial is nonnegative. This problem, closely related to Hilbert's Seventeenth Problem, plays a crucial role in global polynomial optimization and has applications in various fields. First, we introduce SoS-1K, a meticulously curated dataset of approximately 1,000 polynomials, along with expert-designed reasoning instructions based on five progressively challenging criteria. Evaluating multiple state-of-the-art LLMs, we find that without structured guidance, all models perform only slightly above the random guess baseline 50%. However, high-quality reasoning instructions significantly improve accuracy, boosting performance up to 81%. Furthermore, our 7B model, SoS-7B, fine-tuned on SoS-1K for just 4 hours, outperforms the 671B DeepSeek-V3 and GPT-4o-mini in accuracy while only requiring 1.8% and 5% of the computation time needed for letters, respectively. Our findings highlight the potential of LLMs to push the boundaries of mathematical reasoning and tackle NP-hard problems.

SoS1: O1およびR1に類似した推論LLMは二乗和ソルバーである

SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers

要旨

Support