スキル1：強化学習によるスキル拡張エージェントの統合的進化

要旨

持続的スキルライブラリにより、言語モデルエージェントは成功した戦略をタスク横断的に再利用できる。このライブラリを維持するには、3つの連動した能力が求められる。エージェントは関連するスキルを選択し、実行中にそれを活用し、経験から新たなスキルを蒸留する。既存手法ではこれらの能力を個別に、あるいは異なる報酬源で最適化するため、部分的な進化と矛盾が生じる。我々はSkill1を提案する。これは単一のポリシーを訓練し、スキル選択・活用・蒸留を共有のタスク成果目標に向けて共進化させるフレームワークである。ポリシーはスキルライブラリを検索するクエリを生成し、候補を再ランクして選択し、それを条件としてタスクを解決し、軌跡から新たなスキルを蒸留する。学習は単一のタスク成果信号から導出される。その低周波トレンドは選択を、高周波変動は蒸留をそれぞれ評価する。ALFWorldとWebShopでの実験により、Skill1が従来のスキルベース手法および強化学習ベースラインを上回ることを示す。訓練ダイナミクスは3つの能力の共進化を確認し、 ablation実験ではいずれの評価信号を除去しても進化が劣化することを示す。

English

A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.

スキル1：強化学習によるスキル拡張エージェントの統合的進化

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

要旨

Support