大型语言模型能否引导自我探索?基于梯度引导的强化学习在LLM推理中的应用
Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning
December 17, 2025
作者: Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, Dong Yu
cs.AI
摘要
強化學習已成為提升大型語言模型推理能力的關鍵手段,然而現有的探索機制從根本上與模型的實際學習方式存在錯位。熵獎勵和外部語義比較器雖能激發表面層次的變化,但無法保證採樣軌跡在影響優化方向的更新維度上產生實質差異。我們提出梯度引導強化學習(G2RL)框架,其探索驅動力並非來自外部啟發式規則,而是源於模型自身的一階更新幾何。針對每個回應,G2RL 從模型最後一層的敏感度中提取序列級特徵(該特徵可通過標準前向傳播以可忽略的成本獲取),並通過在採樣組內比較這些特徵來衡量每條軌跡對策略的重塑作用。引入新梯度方向的軌跡將獲得有界的乘積獎勵係數,而冗餘或偏離流形的更新則會被抑制,從而產生與 PPO 風格穩定性及 KL 控制自然契合的自指涉探索信號。在 Qwen3 基礎版 1.7B 和 4B 模型上進行的數學與通用推理基準測試(MATH500、AMC、AIME24、AIME25、GPQA、MMLUpro)表明,G2RL 在 pass@1、maj@16 和 pass@k 指標上均穩定優於基於熵的 GRPO 及外部嵌入方法。透過分析誘導出的幾何特徵,我們發現 G2RL 在保持語義連貫性的同時,將探索範圍擴展至更多正交且常呈對立的梯度方向,這揭示出策略自身的更新空間能為大型語言模型強化學習提供更精準有效的探索指導基礎。
English
Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.