大型語言模型與數學推理失敗

摘要

本文探討了大型語言模型（LLMs）在50個新建立的高中級單詞問題中的數學推理能力。與先前僅關注答案正確性的研究不同，我們嚴格分析最終答案和解題步驟，以確定推理失敗。評估了包括Mixtral、Llama、Gemini、GPT-4o和OpenAI的o1變體在內的八種最先進模型，我們發現，雖然新模型（例如o3-mini、deepseek-r1）實現了更高的準確性，但所有模型都存在空間推理、戰略規劃和算術方面的錯誤，有時通過錯誤的邏輯產生正確答案。常見的失敗模式包括毫無根據的假設、過度依賴數字模式和難以將物理直覺轉化為數學步驟。手動分析顯示，模型在需要多步推斷或現實世界知識的問題上遇到困難，儘管具有廣泛的數學知識。我們的結果強調了評估推理過程的重要性，而不僅僅是答案，並警告不要高估LLMs的解決問題能力。該研究突出了LLMs在泛化能力方面存在的持續差距，強調了有必要針對結構化推理和約束處理進行有針對性的改進。

English

This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs' problem-solving proficiency. The study highlights persistent gaps in LLMs' generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.

大型語言模型與數學推理失敗

Large Language Models and Mathematical Reasoning Failures

摘要

Support