魔鬼藏身于蜕变之书:自我进化AI社会中人类安全性的永恒消逝
The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies
February 10, 2026
作者: Chenxu Wang, Chaozhuo Li, Songyang Liu, Zejian Chen, Jinyu Hou, Ji Qi, Rui Li, Litian Zhang, Qiwei Ye, Zheng Liu, Xu Chen, Xi Zhang, Philip S. Yu
cs.AI
摘要
基于大型语言模型构建的多智能体系统的出现,为可扩展的集体智能和自我进化提供了前景广阔的范式。理想情况下,这类系统应在完全闭环中实现持续自我改进,同时保持稳健的安全对齐——我们将这一组合称为自我进化三元悖论。然而,我们通过理论推演和实证研究表明:同时满足持续自我进化、完全隔离运行和安全恒常性的智能体社会是不可能存在的。借助信息论框架,我们将安全性形式化为与人类价值分布的偏离程度,从理论上证明隔离式自我进化会诱发统计盲区,导致系统安全对齐出现不可逆的退化。来自开放式智能体社区(Moltbook)和两个封闭式自进化系统的定性与定量结果,均呈现出与我们理论预测相吻合的安全侵蚀现象。我们进一步提出若干解决方案以缓解已识别的安全隐患。本研究确立了自进化AI社会的根本性局限,将讨论焦点从症状驱动的安全补丁转向对内在动力学风险的原理性认知,强调了引入外部监督或构建新型安全保持机制的必要性。
English
The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.