당신의 에이전트가 잘못 진화할 수 있다: 자기 진화형 LLM 에이전트의 돌발적 위험성

초록

대형 언어 모델(LLMs)의 발전은 환경과의 상호작용을 통해 자율적으로 개선되는 새로운 종류의 자기진화 에이전트를 가능하게 하여 강력한 능력을 보여주고 있다. 그러나 자기진화는 현재의 안전 연구에서 간과된 새로운 위험을 초래하기도 한다. 본 연구에서는 에이전트의 자기진화가 의도하지 않은 방향으로 이탈하여 바람직하지 않거나 심지어 해로운 결과를 초래하는 경우를 탐구한다. 이를 '미스에볼루션(Misevolution)'이라 칭한다. 체계적인 조사를 위해, 우리는 미스에볼루션을 모델, 메모리, 도구, 워크플로우라는 네 가지 주요 진화 경로를 따라 평가한다. 실증적 연구 결과에 따르면, 미스에볼루션은 최상위 LLMs(예: Gemini-2.5-Pro) 위에 구축된 에이전트에도 광범위하게 영향을 미치는 위험임이 밝혀졌다. 자기진화 과정에서 다양한 새로운 위험이 관찰되었는데, 예를 들어 메모리 누적 후 안전 정렬의 저하나 도구 생성 및 재사용 과정에서 의도치 않은 취약점의 도입 등이 있다. 우리가 아는 한, 이는 미스에볼루션을 체계적으로 개념화하고 그 발생에 대한 실증적 증거를 제공한 첫 번째 연구로, 자기진화 에이전트를 위한 새로운 안전 패러다임의 필요성을 강조한다. 마지막으로, 더 안전하고 신뢰할 수 있는 자기진화 에이전트 구축을 위한 추가 연구를 촉진할 잠재적 완화 전략에 대해 논의한다. 우리의 코드와 데이터는 https://github.com/ShaoShuai0605/Misevolution 에서 확인할 수 있다. 경고: 본 논문에는 공격적이거나 유해할 수 있는 예시가 포함되어 있다.

English

Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.

당신의 에이전트가 잘못 진화할 수 있다: 자기 진화형 LLM 에이전트의 돌발적 위험성

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

초록

Support