RAGEN-2:智慧型強化學習中的推理崩潰
RAGEN-2: Reasoning Collapse in Agentic RL
April 7, 2026
作者: Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li
cs.AI
摘要
多輪大型語言模型代理的強化學習訓練本質上不穩定,而推理質量直接決定任務表現。熵值常被用於追蹤推理穩定性,但該指標僅能衡量同一輸入內的多樣性,無法判斷推理是否真正響應不同輸入。在RAGEN-2研究中我們發現,即使熵值穩定,模型仍可能依賴看似多樣但與輸入無關的固定模板。我們稱此為「模板坍塌」現象——這種失效模式是熵值與所有現有指標都無法探測的。為診斷此問題,我們將推理質量分解為輸入內多樣性(熵)與跨輸入區分度(互信息),並提出一系列用於線上診斷的互信息代理指標。在多樣化任務中,互信息與最終表現的關聯度遠超熵值,使其成為更可靠的推理質量代理指標。我們進一步通過信噪比機制解釋模板坍塌:低獎勵方差會削弱任務梯度,導致正則化項主導訓練並消除跨輸入的推理差異。為解決此問題,我們提出SNR感知篩選法,透過獎勵方差作為輕量代理指標,在每輪迭代中選擇高信號量的提示詞。該方法在規劃、數學推理、網絡導航和代碼執行等任務中,持續提升了輸入依賴性與任務表現。
English
RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.