世界知識探索による報酬を用いない自発的自己進化のためのLLMエージェント訓練

要旨

現在のエージェントの多くは、人間が定義した報酬とルールに従って「自己進化」を行っています。しかし、このプロセスは根本的に外部の監督に依存しており、人間の指導がなければ進化は停止します。本研究では、エージェントに内発的なメタ進化能力を付与し、タスク実行前に未経験の環境について自律的に学習する手法を提案します。この能力を習得させるため、エージェントが自己生成した世界知識が下流タスクの成功率をどの程度向上させるかを測定する、結果ベースの報酬メカニズムを設計しました。この報酬信号は学習フェーズでのみ使用され、モデルに効果的な探索と要約の方法を教えます。推論時には、エージェントは外部報酬や人間の指示を一切必要としません。内部パラメータを用いて未知環境に適応するための自律的な自己進化を自然に行います。この自律進化への転換をQwen3-30BとSeed-OSS-36Bに適用した結果、WebVoyagerとWebWalkerにおいて20%の性能向上が確認されました。さらに驚くべきことに、生成された世界知識により、コンパクトな14BパラメータのQwen3モデルが、補助なしのGemini-2.5-Flashを凌駕する性能を示し、真に進化するエージェントの新たなパラダイムを確立しました。

English

Most agents today ``self-evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta-evolution capability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design an outcome-based reward mechanism that measures how much an agent's self-generated world knowledge improves its success rate on downstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performs native self-evolution to adapt to unknown environments using its internal parameters. When applied to Qwen3-30B and Seed-OSS-36B, this shift to native evolution yields a 20% performance increase on WebVoyager and WebWalker. Most strikingly, the generated world knowledge even enables a compact 14B Qwen3 model to outperform the unassisted Gemini-2.5-Flash, establishing a new paradigm for truly evolving agents.

世界知識探索による報酬を用いない自発的自己進化のためのLLMエージェント訓練

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

要旨

Support