RAGEN-2：智能体强化学习中的推理崩塌现象

摘要

多轮大语言模型智能体的强化学习训练天然具有不稳定性，而推理质量直接决定任务表现。熵值常被用于追踪推理稳定性，但该方法仅能衡量同一输入内部的多样性，无法判断推理是否真正响应不同输入。在RAGEN-2研究中我们发现，即使熵值保持稳定，模型仍可能依赖看似多样但实则与输入无关的固定模板。我们将这种失效模式称为"模板坍塌"，该现象无法通过熵值或现有任何指标被察觉。为诊断此问题，我们将推理质量分解为输入内多样性（熵）和跨输入区分度（互信息），并引入一系列互信息代理指标进行实时诊断。在多样化任务中，互信息与最终性能的相关性远强于熵值，使其成为更可靠的推理质量代理指标。我们进一步通过信噪比机制解释模板坍塌现象：低奖励方差会削弱任务梯度，使正则化项占据主导地位，从而抹平跨输入推理差异。为此，我们提出SNR感知过滤法，通过奖励方差作为轻量级代理指标，在每轮迭代中筛选高信号提示。该方法在规划、数学推理、网页导航和代码执行等任务中，持续提升了输入依赖性与任务性能。

English

RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.