无训练组相对策略优化

摘要

近期，大型语言模型（LLM）代理的进展展现了其广泛的应用潜力。然而，在特定现实领域中的表现往往因难以有效整合外部工具及特定提示策略而大打折扣。尽管已有如代理强化学习等方法被提出以应对此问题，但这些方法通常依赖于代价高昂的参数更新，例如通过监督微调（SFT）结合使用群体相对策略优化（GRPO）的强化学习（RL）阶段来调整输出分布。然而，我们认为，LLM通过学习经验知识作为令牌先验，能够对输出分布产生类似效果，这是一种更为轻量级的方法，不仅解决了实际数据稀缺的问题，还避免了常见的过拟合现象。为此，我们提出了无需训练的群体相对策略优化（Training-Free GRPO），这是一种无需参数更新的高效解决方案，旨在提升LLM代理的性能。我们的方法利用群体间的相对语义优势而非数值优势，在少量真实数据上进行多轮学习，迭代提炼高质量的经验知识。此类知识作为学习到的令牌先验，在LLM API调用时无缝集成，以指导模型行为。在数学推理和网络搜索任务上的实验表明，将Training-Free GRPO应用于DeepSeek-V3.1-Terminus后，其跨领域性能显著提升。仅需数十个训练样本，Training-Free GRPO便能在边际训练数据与成本下，超越经过微调的小型LLM。

English

Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.