基于世界知识探索的LLM智能体无奖励自发自进化训练
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
April 20, 2026
作者: Qifan Zhang, Dongyang Ma, Tianqing Fang, Jia Li, Jing Tang, Nuo Chen, Haitao Mi, Yan Wang
cs.AI
摘要
当前大多数智能体仍通过遵循人类设定的奖励规则实现"自我进化",但这种进化本质上依赖外部监督,一旦失去人类指导便会停滞。本研究旨在训练智能体获得内在的元进化能力,使其能在执行任务前自主认知未知环境。
为实现这一目标,我们设计了基于成效的奖励机制,通过评估智能体自主生成的世界知识对下游任务成功率的提升程度来量化进化效果。该奖励信号仅在训练阶段用于引导模型掌握有效探索与知识归纳的方法。在推理阶段,智能体无需外部奖励或人工指令,仅凭内部参数即可实现原生自我进化以适应未知环境。
将这种方法应用于Qwen3-30B和Seed-OSS-36B模型后,其在WebVoyager和WebWalker上的性能提升达20%。尤为显著的是,基于自主生成的世界知识,仅140亿参数的Qwen3紧凑模型甚至超越了未加持的Gemini-2.5-Flash,这为真正意义上的进化型智能体确立了新范式。
English
Most agents today ``self-evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta-evolution capability to spontaneously learn about unseen environments prior to task execution.
To instill this ability, we design an outcome-based reward mechanism that measures how much an agent's self-generated world knowledge improves its success rate on downstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performs native self-evolution to adapt to unknown environments using its internal parameters.
When applied to Qwen3-30B and Seed-OSS-36B, this shift to native evolution yields a 20% performance increase on WebVoyager and WebWalker. Most strikingly, the generated world knowledge even enables a compact 14B Qwen3 model to outperform the unassisted Gemini-2.5-Flash, establishing a new paradigm for truly evolving agents.