FINEREASON:透過反思性謎題解決評估與提升大型語言模型的深思熟慮推理能力
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving
February 27, 2025
作者: Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, Yu Rong
cs.AI
摘要
許多具有挑戰性的推理任務不僅需要快速、直覺的反應,更需採用一種深思熟慮、多步驟的解決方式。近期大型語言模型(LLMs)的進展,凸顯了從「系統1」快速反應模式向「系統2」反思與修正問題解決風格的重要轉變。然而,現有的基準測試過於依賴最終答案的準確性,而忽略了模型在推理過程中的中間步驟,這使得評估模型在推理過程中反思與糾正錯誤的能力成為一大盲點。為填補這一缺口,我們推出了FINEREASON,這是一個專為細緻評估LLMs推理能力設計的邏輯謎題基準。每個謎題均可分解為基本步驟,非常適合用於嚴格驗證中間推理的正確性。基於此,我們引入了兩項任務:狀態檢查與狀態轉移,以全面評估模型如何評估當前狀況並規劃下一步行動。為支持更廣泛的研究,我們還提供了一個謎題訓練集,旨在提升模型在一般數學任務上的表現。我們展示,經過我們狀態檢查與轉移數據訓練的模型,在GSM8K數學推理任務上的表現提升了高達5.1%。
English
Many challenging reasoning tasks require not just rapid, intuitive responses,
but a more deliberate, multi-step approach. Recent progress in large language
models (LLMs) highlights an important shift from the "System 1" way of quick
reactions to the "System 2" style of reflection-and-correction problem solving.
However, current benchmarks heavily rely on the final-answer accuracy, leaving
much of a model's intermediate reasoning steps unexamined. This fails to assess
the model's ability to reflect and rectify mistakes within the reasoning
process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark
for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be
decomposed into atomic steps, making it ideal for rigorous validation of
intermediate correctness. Building on this, we introduce two tasks: state
checking, and state transition, for a comprehensive evaluation of how models
assess the current situation and plan the next move. To support broader
research, we also provide a puzzle training set aimed at enhancing performance
on general mathematical tasks. We show that models trained on our state
checking and transition data demonstrate gains in math reasoning by up to 5.1%
on GSM8K.Summary
AI-Generated Summary