ChatPaper.aiChatPaper

促使自我提升推理者發展的認知行為,或稱高效STaRs的四項習慣

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

March 3, 2025
作者: Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, Noah D. Goodman
cs.AI

摘要

測試時推理已成為一種強大的範式,使語言模型能夠像熟練的人類專家一樣,對複雜挑戰進行更長時間、更仔細的「思考」。雖然強化學習(RL)可以推動語言模型在可驗證任務上的自我提升,但有些模型表現出顯著的進步,而其他模型則迅速停滯。例如,我們發現,在相同的RL訓練下,Qwen-2.5-3B在Countdown遊戲中的表現遠遠超過Llama-3.2-3B。這種差異引發了一個關鍵問題:哪些內在特性促成了有效的自我提升?我們引入了一個框架來探討這個問題,通過分析四種關鍵的認知行為——驗證、回溯、子目標設定和反向鏈接——這些行為既是專家級人類問題解決者也是成功的語言模型所採用的。我們的研究揭示,Qwen自然地展現了這些推理行為,而Llama最初則缺乏這些行為。在系統化的實驗中,使用受控的行為數據集,我們發現,通過提供包含這些推理行為的示例來引導Llama,能夠在RL過程中實現顯著的改進,達到甚至超越Qwen的表現。重要的是,推理行為的存在,而非答案的正確性,被證明是關鍵因素——使用包含正確推理模式但答案錯誤的解決方案進行引導的模型,其表現與使用正確解決方案訓練的模型相當。最後,利用OpenWebMath數據進行持續預訓練,並過濾以增強推理行為,使Llama模型能夠匹配Qwen的自我提升軌跡。我們的研究結果建立了初始推理行為與提升能力之間的基本關係,解釋了為什麼一些語言模型能夠有效利用額外的計算資源,而其他模型則停滯不前。
English
Test-time inference has emerged as a powerful paradigm for enabling language models to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.

Summary

AI-Generated Summary

PDF383March 4, 2025