FINEREASON：透過反思性謎題解決評估與提升大型語言模型的深思熟慮推理能力

摘要

許多具有挑戰性的推理任務不僅需要快速、直覺的反應，更需採用一種深思熟慮、多步驟的解決方式。近期大型語言模型（LLMs）的進展，凸顯了從「系統1」快速反應模式向「系統2」反思與修正問題解決風格的重要轉變。然而，現有的基準測試過於依賴最終答案的準確性，而忽略了模型在推理過程中的中間步驟，這使得評估模型在推理過程中反思與糾正錯誤的能力成為一大盲點。為填補這一缺口，我們推出了FINEREASON，這是一個專為細緻評估LLMs推理能力設計的邏輯謎題基準。每個謎題均可分解為基本步驟，非常適合用於嚴格驗證中間推理的正確性。基於此，我們引入了兩項任務：狀態檢查與狀態轉移，以全面評估模型如何評估當前狀況並規劃下一步行動。為支持更廣泛的研究，我們還提供了一個謎題訓練集，旨在提升模型在一般數學任務上的表現。我們展示，經過我們狀態檢查與轉移數據訓練的模型，在GSM8K數學推理任務上的表現提升了高達5.1%。

English

Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model's intermediate reasoning steps unexamined. This fails to assess the model's ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing performance on general mathematical tasks. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.

FINEREASON：透過反思性謎題解決評估與提升大型語言模型的深思熟慮推理能力

FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving

摘要

Support