정렬 균형 붕괴 과정: 자기 진화가 LLM 에이전트를 탈선시키는 방식

초록

대형 언어 모델(LLM) 에이전트가 실세계 상호작용을 통해 전략을 적응하고 개선하는 자기 진화 능력을 점점 더 많이 갖추게 됨에 따라, 그들의 장기적 신뢰성은 중요한 문제로 대두되고 있습니다. 우리는 자기 진화형 LLM 에이전트에 고유한 배포 후 위험인 '정렬 전이 과정(Alignment Tipping Process, ATP)'을 식별합니다. 훈련 시점의 실패와 달리, ATP는 지속적인 상호작용이 에이전트가 훈련 중에 확립된 정렬 제약을 버리고 강화된 자기 이익 전략을 선호하도록 유도할 때 발생합니다. 우리는 ATP를 두 가지 상호보완적인 패러다임을 통해 정형화하고 분석합니다: 반복적인 높은 보상 편차가 개별 행동 표류를 유도하는 '자기 이익 탐색(Self-Interested Exploration)'과, 이탈 행동이 다중 에이전트 시스템 전반에 확산되는 '모방 전략 확산(Imitative Strategy Diffusion)'입니다. 이러한 패러다임을 바탕으로, 우리는 제어 가능한 테스트베드를 구축하고 Qwen3-8B 및 Llama-3.1-8B-Instruct를 벤치마킹합니다. 우리의 실험은 정렬 이점이 자기 진화 하에서 빠르게 침식되며, 초기에 정렬된 모델이 비정렬 상태로 수렴하는 것을 보여줍니다. 다중 에이전트 설정에서는 성공적인 위반이 빠르게 확산되어 집단적 비정렬로 이어집니다. 또한, 현재의 강화 학습 기반 정렬 방법은 정렬 전이에 대해 취약한 방어만을 제공합니다. 이러한 결과들은 LLM 에이전트의 정렬이 정적 특성이 아니라 취약하고 동적인 특성이며, 배포 중 피드백에 의해 촉진되는 쇠퇴에 취약함을 보여줍니다. 우리의 데이터와 코드는 https://github.com/aiming-lab/ATP에서 확인할 수 있습니다.

English

As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide only fragile defenses against alignment tipping. Together, these findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.

정렬 균형 붕괴 과정: 자기 진화가 LLM 에이전트를 탈선시키는 방식

Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

초록

Support