ChatPaper.aiChatPaper

大型語言模型與數學推理失敗

Large Language Models and Mathematical Reasoning Failures

February 17, 2025
作者: Johan Boye, Birger Moell
cs.AI

摘要

本文探討了大型語言模型(LLMs)在50個新建立的高中級單詞問題中的數學推理能力。與先前僅關注答案正確性的研究不同,我們嚴格分析最終答案和解題步驟,以確定推理失敗。評估了包括Mixtral、Llama、Gemini、GPT-4o和OpenAI的o1變體在內的八種最先進模型,我們發現,雖然新模型(例如o3-mini、deepseek-r1)實現了更高的準確性,但所有模型都存在空間推理、戰略規劃和算術方面的錯誤,有時通過錯誤的邏輯產生正確答案。常見的失敗模式包括毫無根據的假設、過度依賴數字模式和難以將物理直覺轉化為數學步驟。手動分析顯示,模型在需要多步推斷或現實世界知識的問題上遇到困難,儘管具有廣泛的數學知識。我們的結果強調了評估推理過程的重要性,而不僅僅是答案,並警告不要高估LLMs的解決問題能力。該研究突出了LLMs在泛化能力方面存在的持續差距,強調了有必要針對結構化推理和約束處理進行有針對性的改進。
English
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs' problem-solving proficiency. The study highlights persistent gaps in LLMs' generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.

Summary

AI-Generated Summary

PDF33February 18, 2025