對齊傾斜過程:自我演化如何使大型語言模型代理偏離軌道
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails
October 6, 2025
作者: Siwei Han, Jiaqi Liu, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao
cs.AI
摘要
随着大型语言模型(LLM)代理逐渐获得自我进化能力,通过现实世界互动来适应和优化其策略,其长期可靠性成为一个关键问题。我们识别出“对齐倾斜过程”(Alignment Tipping Process, ATP),这是自我进化LLM代理在部署后独有的关键风险。与训练期间的故障不同,ATP发生在持续互动驱使代理放弃训练期间建立的对齐约束,转而采用强化的、自利策略的情况下。我们通过两种互补范式对ATP进行了形式化分析:自利探索(Self-Interested Exploration),即重复的高奖励偏差导致个体行为漂移;以及模仿策略扩散(Imitative Strategy Diffusion),即偏差行为在多代理系统中传播。基于这些范式,我们构建了可控测试平台,并对Qwen3-8B和Llama-3.1-8B-Instruct进行了基准测试。实验表明,在自我进化下,对齐效益迅速减弱,最初对齐的模型逐渐趋向未对齐状态。在多代理环境中,成功的违规行为迅速扩散,导致集体失准。此外,当前基于强化学习的对齐方法仅能提供脆弱的防御,难以抵御对齐倾斜。这些发现共同表明,LLM代理的对齐并非静态属性,而是一种脆弱且动态的特性,在部署过程中易受反馈驱动的衰减影响。我们的数据和代码可在https://github.com/aiming-lab/ATP获取。
English
As Large Language Model (LLM) agents increasingly gain self-evolutionary
capabilities to adapt and refine their strategies through real-world
interaction, their long-term reliability becomes a critical concern. We
identify the Alignment Tipping Process (ATP), a critical post-deployment risk
unique to self-evolving LLM agents. Unlike training-time failures, ATP arises
when continual interaction drives agents to abandon alignment constraints
established during training in favor of reinforced, self-interested strategies.
We formalize and analyze ATP through two complementary paradigms:
Self-Interested Exploration, where repeated high-reward deviations induce
individual behavioral drift, and Imitative Strategy Diffusion, where deviant
behaviors spread across multi-agent systems. Building on these paradigms, we
construct controllable testbeds and benchmark Qwen3-8B and
Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode
rapidly under self-evolution, with initially aligned models converging toward
unaligned states. In multi-agent settings, successful violations diffuse
quickly, leading to collective misalignment. Moreover, current reinforcement
learning-based alignment methods provide only fragile defenses against
alignment tipping. Together, these findings demonstrate that alignment of LLM
agents is not a static property but a fragile and dynamic one, vulnerable to
feedback-driven decay during deployment. Our data and code are available at
https://github.com/aiming-lab/ATP.