大语言模型能否引导自我探索?基于梯度引导的强化学习在LLM推理中的研究
Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning
December 17, 2025
作者: Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, Dong Yu
cs.AI
摘要
强化学习已成为增强大语言模型推理能力的关键手段,然而当前探索机制的本质仍与模型实际学习方式存在错配。熵奖励和外部语义比较器虽能促进表面多样性,但无法保证采样轨迹在影响优化的更新方向上产生实质差异。我们提出梯度引导强化学习框架G2RL,其探索驱动力并非来自外部启发式规则,而是源于模型自身的一阶更新几何。针对每个响应,G2RL从模型最终层的敏感度中构建序列级特征——该特征可通过标准前向传播以可忽略的成本获取,并通过在采样组内比较这些特征来度量每条轨迹将如何重塑策略。引入新颖梯度方向的轨迹会获得有界的乘性奖励缩放因子,而冗余或偏离流形的更新则会被弱化,从而产生与PPO风格稳定性及KL控制天然契合的自参照探索信号。在Qwen3基础版1.7B和4B模型的数学与通用推理基准测试中,G2RL在pass@1、maj@16和pass@k指标上持续优于基于熵的GRPO和外部嵌入方法。通过分析诱导几何特征,我们发现G2RL在保持语义连贯性的同时,将探索范围扩展至更多正交且常呈对立的梯度方向,这表明策略自身的更新空间能为大语言模型强化学习的探索引导提供更忠实有效的基准。
English
Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.