トレーニング不要のグループ相対ポリシー最適化

要旨

大規模言語モデル（LLM）エージェントの最近の進歩は、その汎用的な能力の有望さを示しています。しかし、専門的な現実世界のドメインにおける性能は、外部ツールの効果的な統合や特定のプロンプト戦略の課題によりしばしば低下します。エージェント強化学習のような手法がこれを解決するために提案されていますが、これらは通常、高コストなパラメータ更新に依存しています。例えば、教師ありファインチューニング（SFT）を経て、Group Relative Policy Optimization（GRPO）を用いた強化学習（RL）フェーズを通じて出力分布を変更するプロセスが挙げられます。しかし、我々は、LLMがトークンプライアとして経験的知識を学習することで、同様の出力分布効果を達成できると主張します。これは、実践的なデータ不足に対処するだけでなく、過学習という一般的な問題を回避する、はるかに軽量なアプローチです。この目的のために、我々はパラメータ更新を一切必要としない、コスト効率の高い解決策であるTraining-Free Group Relative Policy Optimization（Training-Free GRPO）を提案します。この手法は、各ロールアウトグループ内で数値的な利点ではなくグループ相対的な意味的利点を活用し、最小限のグラウンドトゥルースデータ上でのマルチエポック学習中に高品質な経験的知識を反復的に蒸留します。このような知識は学習されたトークンプライアとして機能し、LLM API呼び出し中にシームレスに統合されてモデルの振る舞いを導きます。数学的推論とウェブ検索タスクにおける実験では、Training-Free GRPOをDeepSeek-V3.1-Terminusに適用することで、ドメイン外の性能が大幅に向上することが示されました。わずか数十のトレーニングサンプルで、Training-Free GRPOは限られたトレーニングデータとコストでファインチューニングされた小型LLMを上回りました。

English

Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.