大規模言語モデルと数学的推論の失敗

要旨

本論文では、新たに構築した50問の高校レベルの文章題を用いて、大規模言語モデル（LLM）の数学的推論能力を調査する。従来の研究が答えの正誤のみに焦点を当てていたのに対し、我々は最終的な答えと解決ステップの両方を厳密に分析し、推論の失敗を特定する。Mixtral、Llama、Gemini、GPT-4o、OpenAIのo1バリアントを含む8つの最先端モデルを評価した結果、新しいモデル（例：o3-mini、deepseek-r1）はより高い精度を達成するものの、すべてのモデルが空間推論、戦略的計画、算術においてエラーを示し、時には誤った論理を通じて正しい答えを導くことが明らかになった。一般的な失敗モードには、根拠のない仮定、数値パターンへの過度の依存、物理的直感を数学的ステップに変換する難しさが含まれる。手動分析により、モデルが多段階の推論や実世界の知識を必要とする問題に苦戦することが明らかになったにもかかわらず、広範な数学的知識を有していることが示された。我々の結果は、答えだけでなく推論プロセスを評価することの重要性を強調し、LLMの問題解決能力を過大評価することに警鐘を鳴らす。本研究は、LLMの一般化能力における持続的なギャップを浮き彫りにし、構造化された推論と制約処理のターゲットを絞った改善の必要性を強調する。

English

This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs' problem-solving proficiency. The study highlights persistent gaps in LLMs' generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.

大規模言語モデルと数学的推論の失敗

Large Language Models and Mathematical Reasoning Failures

要旨

Support