无意义亦有助益:提示空间扰动拓展推理探索边界
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
May 7, 2026
作者: Langlin Huang, Chengsong Huang, Jinyuan Li, Donghong Cai, Yuyi Yang, Jiaxin Huang
cs.AI
摘要
具有可验证奖励的强化学习,特别是群组相对策略优化(GRPO),显著提升了大型语言模型(LLM)的推理能力。然而在复杂任务中,GRPO常遭遇"零优势问题":当某个查询的所有采样轨迹均失败时,相对优势会坍缩为零。这导致模型在此类问题上失去有效训练信号,造成训练数据和计算资源的浪费。虽然增加问题采样量是常见补救措施,但静态采样策略本质上限制了推理探索空间,制约了成功率。本文提出探索性乱序扰动(LoPE),通过简单而有效的训练框架突破这一探索瓶颈。我们认为,与任务无关的提示空间扰动足以改变模型的输出分布,从而为难题解锁正交的推理路径。具体而言,LoPE在重新采样前将随机组合的乱序假文词汇(一种伪拉丁占位文本)序列附加至提示前端。在1.7B、4B和7B模型上的实验表明,LoPE显著优于原始提示的重复采样。进一步分析揭示,其他基于拉丁语的低困惑度随机序列同样能产生有效扰动。我们的研究结果确立了LoPE作为扩展LLM强化学习探索能力的强基准方法。
English
Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model's output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.