ChatPaper.aiChatPaper

关于搜索R1中GRPO崩溃:惰性似然位移死亡螺旋

On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

December 3, 2025
作者: Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li
cs.AI

摘要

工具集成强化学习(TI-RL)通过让大语言模型(LLM)与搜索引擎、检索器等外部工具交互,实现了多步推理能力。以近期提出的Search-R1为代表的群体相对策略优化(GRPO)方法,凭借其快速收敛特性和无需价值函数的设计,在此场景中展现出独特优势,但始终存在训练崩溃的问题。我们发现,驱动这一失败的核心机制是**似然惰性偏移(LLD)**——即正确与错误回答的似然率出现系统性降低或停滞。LLD在训练早期出现并触发自我强化的“LLD死亡螺旋”:似然率下降导致低置信度响应,进而引发梯度膨胀,最终造成崩溃。我们通过搜索集成问答任务的实验,在多类模型上实证揭示了这一过程遵循一致的三阶段轨迹:早期停滞、持续衰减和加速崩溃。针对此问题,我们提出一种轻量级似然保持正则化方法LLDS,仅在轨迹似然下降时激活,且仅对责任标记进行正则化。这种细粒度结构能以最小优化干扰缓解LLD现象。在七个开放域和多跳问答基准测试中,该方法有效稳定了训练过程,防止梯度爆炸,并带来显著性能提升——Qwen2.5-3B模型提升37.8%,Qwen2.5-7B模型提升32.0%。本研究将LLD确立为基于GRPO的TI-RL的核心瓶颈,并为实现稳定可扩展的工具集成LLM训练提供了可行路径。
English
Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.
PDF71December 6, 2025