您的智能体可能发生异变：自我进化大语言模型智能体的潜在风险

摘要

大型语言模型（LLMs）的进步催生了一类新型自我进化智能体，它们通过与环境的交互自主提升，展现出强大的能力。然而，自我进化也带来了当前安全研究尚未关注的新风险。本文探讨了智能体自我进化偏离预期方向，导致不良甚至有害后果的情况，我们称之为“误进化”。为系统研究此现象，我们从模型、记忆、工具和工作流四个关键进化路径评估误进化。实证结果表明，误进化是一种普遍存在的风险，即便是基于顶尖LLMs（如Gemini-2.5-Pro）构建的智能体也难以幸免。在自我进化过程中，我们观察到多种新兴风险，例如记忆积累后安全对齐的退化，或工具创建与重用中无意引入的漏洞。据我们所知，这是首次系统化地概念化误进化并提供其发生实证的研究，凸显了对自我进化智能体构建新安全范式的迫切需求。最后，我们探讨了潜在的缓解策略，以激发构建更安全、更可信赖的自我进化智能体的进一步研究。我们的代码和数据可在https://github.com/ShaoShuai0605/Misevolution 获取。警告：本文包含可能具有冒犯性或有害性质的示例。

English

Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.

您的智能体可能发生异变：自我进化大语言模型智能体的潜在风险

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

摘要

Support