无训练组相对策略优化
Training-Free Group Relative Policy Optimization
October 9, 2025
作者: Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun
cs.AI
摘要
近期,大型语言模型(LLM)代理的进展展现了其广泛的应用潜力。然而,在特定现实领域中的表现往往因难以有效整合外部工具及特定提示策略而大打折扣。尽管已有如代理强化学习等方法被提出以应对此问题,但这些方法通常依赖于代价高昂的参数更新,例如通过监督微调(SFT)结合使用群体相对策略优化(GRPO)的强化学习(RL)阶段来调整输出分布。然而,我们认为,LLM通过学习经验知识作为令牌先验,能够对输出分布产生类似效果,这是一种更为轻量级的方法,不仅解决了实际数据稀缺的问题,还避免了常见的过拟合现象。为此,我们提出了无需训练的群体相对策略优化(Training-Free GRPO),这是一种无需参数更新的高效解决方案,旨在提升LLM代理的性能。我们的方法利用群体间的相对语义优势而非数值优势,在少量真实数据上进行多轮学习,迭代提炼高质量的经验知识。此类知识作为学习到的令牌先验,在LLM API调用时无缝集成,以指导模型行为。在数学推理和网络搜索任务上的实验表明,将Training-Free GRPO应用于DeepSeek-V3.1-Terminus后,其跨领域性能显著提升。仅需数十个训练样本,Training-Free GRPO便能在边际训练数据与成本下,超越经过微调的小型LLM。
English
Recent advances in Large Language Model (LLM) agents have demonstrated their
promising general capabilities. However, their performance in specialized
real-world domains often degrades due to challenges in effectively integrating
external tools and specific prompting strategies. While methods like agentic
reinforcement learning have been proposed to address this, they typically rely
on costly parameter updates, for example, through a process that uses
Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase
with Group Relative Policy Optimization (GRPO) to alter the output
distribution. However, we argue that LLMs can achieve a similar effect on the
output distribution by learning experiential knowledge as a token prior, which
is a far more lightweight approach that not only addresses practical data
scarcity but also avoids the common issue of overfitting. To this end, we
propose Training-Free Group Relative Policy Optimization (Training-Free GRPO),
a cost-effective solution that enhances LLM agent performance without any
parameter updates. Our method leverages the group relative semantic advantage
instead of numerical ones within each group of rollouts, iteratively distilling
high-quality experiential knowledge during multi-epoch learning on a minimal
ground-truth data. Such knowledge serves as the learned token prior, which is
seamlessly integrated during LLM API calls to guide model behavior. Experiments
on mathematical reasoning and web searching tasks demonstrate that
Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly
improves out-of-domain performance. With just a few dozen training samples,
Training-Free GRPO outperforms fine-tuned small LLMs with marginal training
data and cost.