あなたのエージェントは誤進化する可能性がある：自己進化型LLMエージェントにおける新たなリスク

要旨

大規模言語モデル（LLM）の進展により、環境との相互作用を通じて自律的に進化する新たなクラスのエージェントが実現され、強力な能力を示しています。しかし、自己進化はまた、現在の安全性研究では見過ごされている新たなリスクをもたらします。本研究では、エージェントの自己進化が意図しない方向に逸脱し、望ましくない、あるいは有害な結果を引き起こすケースを検討します。これを「誤進化（Misevolution）」と呼びます。体系的な調査を提供するため、誤進化を4つの主要な進化経路（モデル、メモリ、ツール、ワークフロー）に沿って評価します。実証的な結果から、誤進化はトップクラスのLLM（例：Gemini-2.5-Pro）上に構築されたエージェントにも広く影響を及ぼすリスクであることが明らかになりました。自己進化プロセスにおいて、メモリ蓄積後の安全性アライメントの劣化や、ツール作成および再利用における意図しない脆弱性の導入など、さまざまな新たなリスクが観察されました。私たちの知る限り、誤進化を体系的に概念化し、その発生を実証的に示した初めての研究であり、自己進化エージェントのための新たな安全性パラダイムの緊急性を強調しています。最後に、より安全で信頼性の高い自己進化エージェントを構築するためのさらなる研究を促すための緩和策について議論します。コードとデータはhttps://github.com/ShaoShuai0605/Misevolutionで公開されています。注意：本論文には攻撃的または有害な性質の例が含まれています。

English

Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.

あなたのエージェントは誤進化する可能性がある：自己進化型LLMエージェントにおける新たなリスク

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

要旨

Support