훈련 없이 그룹 상대 정책 최적화

초록

최근 대규모 언어 모델(LLM) 에이전트의 발전은 이들의 유망한 일반적 능력을 입증해 왔습니다. 그러나 특화된 실제 도메인에서의 성능은 외부 도구와 특정 프롬프트 전략을 효과적으로 통합하는 데 어려움을 겪으면서 종종 저하됩니다. 이를 해결하기 위해 에이전트 강화 학습과 같은 방법들이 제안되었지만, 이들은 일반적으로 비용이 많이 드는 파라미터 업데이트에 의존합니다. 예를 들어, 지도 미세 조정(SFT)을 거친 후 그룹 상대 정책 최적화(GRPO)를 사용한 강화 학습(RL) 단계를 통해 출력 분포를 변경하는 방식입니다. 그러나 우리는 LLM이 토큰 사전으로서 경험적 지식을 학습함으로써 출력 분포에 유사한 효과를 달성할 수 있다고 주장합니다. 이는 훨씬 더 가벼운 접근 방식으로, 실질적인 데이터 부족 문제를 해결할 뿐만 아니라 과적합이라는 일반적인 문제를 피할 수 있습니다. 이를 위해 우리는 파라미터 업데이트 없이도 LLM 에이전트 성능을 향상시키는 비용 효율적인 솔루션인 Training-Free Group Relative Policy Optimization(Training-Free GRPO)을 제안합니다. 우리의 방법은 각 롤아웃 그룹 내에서 수치적 이점 대신 그룹 상대적 의미적 이점을 활용하여, 최소한의 실측 데이터에 대한 다중 에포크 학습 동안 고품질의 경험적 지식을 반복적으로 추출합니다. 이러한 지식은 학습된 토큰 사전으로서 작용하며, LLM API 호출 동안 원활하게 통합되어 모델 행동을 안내합니다. 수학적 추론 및 웹 검색 작업에 대한 실험은 Training-Free GRPO가 DeepSeek-V3.1-Terminus에 적용될 때 도메인 외 성능을 크게 향상시킴을 보여줍니다. 단 몇십 개의 훈련 샘플만으로도 Training-Free GRPO는 미세 조정된 소형 LLM을 훈련 데이터와 비용 측면에서 능가합니다.

English

Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.