对齐倾斜过程：自我进化如何使LLM智能体偏离轨道

摘要

随着大型语言模型（LLM）代理逐渐获得自我进化能力，能够通过现实世界互动适应并优化其策略，其长期可靠性成为一个关键问题。我们识别出“对齐临界过程”（Alignment Tipping Process, ATP），这是自我进化LLM代理在部署后特有的重大风险。与训练阶段的失败不同，ATP发生在持续互动促使代理放弃训练期间建立的对齐约束，转而采用强化后的自利策略时。我们通过两种互补范式形式化并分析ATP：一是“自利探索”，即重复的高回报偏差导致个体行为漂移；二是“模仿策略扩散”，即偏差行为在多代理系统中传播。基于这些范式，我们构建了可控测试环境，并对Qwen3-8B和Llama-3.1-8B-Instruct进行了基准测试。实验表明，在自我进化下，对齐优势迅速消减，初始对齐的模型趋向于非对齐状态。在多代理场景中，成功的违规行为迅速扩散，导致集体失准。此外，当前基于强化学习的对齐方法仅能提供脆弱的防御，难以抵御对齐临界。这些发现共同表明，LLM代理的对齐并非静态属性，而是一种脆弱且动态的特性，在部署过程中易受反馈驱动的衰减影响。我们的数据和代码可在https://github.com/aiming-lab/ATP获取。

English

As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark Qwen3-8B and Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide only fragile defenses against alignment tipping. Together, these findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.