無需訓練的群組相對策略優化

摘要

近年來，大型語言模型（LLM）代理的進展展現了其廣泛的通用能力。然而，在專業的現實領域中，其表現往往因有效整合外部工具和特定提示策略的挑戰而下降。雖然已提出如代理強化學習等方法來應對這一問題，但這些方法通常依賴於成本高昂的參數更新，例如通過監督微調（SFT）後接使用群組相對策略優化（GRPO）的強化學習（RL）階段來改變輸出分佈。然而，我們認為LLM可以通過學習經驗知識作為令牌先驗來實現對輸出分佈的類似效果，這是一種更為輕量級的方法，不僅解決了實際數據稀缺的問題，還避免了常見的過擬合問題。為此，我們提出了無訓練群組相對策略優化（Training-Free GRPO），這是一種無需參數更新的成本效益解決方案，能夠提升LLM代理的表現。我們的方法利用群組相對語義優勢而非數值優勢，在最小真實數據上進行多輪學習，迭代提煉高質量的經驗知識。這些知識作為學習到的令牌先驗，在LLM API調用期間無縫整合，以指導模型行為。在數學推理和網絡搜索任務上的實驗表明，當應用於DeepSeek-V3.1-Terminus時，Training-Free GRPO顯著提升了域外表現。僅需幾十個訓練樣本，Training-Free GRPO便超越了使用少量訓練數據和成本進行微調的小型LLM。

English

Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.