您的代理可能誤入歧途：自我進化大型語言模型代理中的新興風險

摘要

大型语言模型（LLMs）的进展催生了一类新型自进化智能体，这些智能体通过与环境互动自主提升，展现出强大的能力。然而，自进化也带来了当前安全研究尚未充分关注的新风险。本研究探讨了智能体自进化偏离预期方向，导致不良甚至有害后果的情形，我们称之为“误进化”。为系统性地探究这一问题，我们从模型、记忆、工具和工作流四个关键进化路径对误进化进行了评估。实证结果表明，误进化是一种普遍存在的风险，即便是基于顶尖LLMs（如Gemini-2.5-Pro）构建的智能体也难以幸免。在自进化过程中，我们观察到了多种新兴风险，例如记忆积累后安全对齐性的退化，或工具创建与重用中无意引入的漏洞。据我们所知，这是首次系统性地概念化误进化并提供其发生实证的研究，强调了为自进化智能体建立新安全范式的迫切需求。最后，我们探讨了潜在的缓解策略，以期激发构建更安全、更可信自进化智能体的进一步研究。我们的代码与数据可在https://github.com/ShaoShuai0605/Misevolution获取。警告：本文包含可能具有冒犯性或有害性质的示例。

English

Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.