魔鬼藏於蛻變之書背後:自我演化AI社會中永恆消逝的人本安全
The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies
February 10, 2026
作者: Chenxu Wang, Chaozhuo Li, Songyang Liu, Zejian Chen, Jinyu Hou, Ji Qi, Rui Li, Litian Zhang, Qiwei Ye, Zheng Liu, Xu Chen, Xi Zhang, Philip S. Yu
cs.AI
摘要
基於大型語言模型建構的多代理系統的興起,為可擴展的集體智慧與自我演化提供了一個前景廣闊的範式。理想情況下,這類系統應能在完全閉環中實現持續自我改進,同時保持穩健的安全對齊——我們將這種組合稱為「自我演化三重困境」。然而,我們從理論與實證兩方面證明,要同時滿足持續自我演化、完全隔離性與安全不變性的代理社會是不可能的。藉助資訊理論框架,我們將安全性形式化為與人類價值分佈的偏離程度,並從理論上論證隔離式自我演化會誘發統計盲點,導致系統安全對齊出現不可逆的退化。來自開放式代理社群(Moltbook)及兩個封閉式自演化系統的實證與質性結果,均呈現出與我們理論預測相符的必然安全侵蝕現象。我們進一步提出多個解決方向以緩解此安全隱患。本研究確立了自演化人工智慧社會的根本限制,將討論焦點從症狀導向的安全修補轉向對內在動態風險的原則性理解,並凸顯外部監管或新型安全維護機制的必要性。
English
The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.