De Duivel Achter Moltbook: Veiligheid bij Anthropic Verdwijnt Altijd in Zelf-evoluerende AI-samenlevingen

Samenvatting

De opkomst van multi-agent systemen gebouwd op grote taalmodellen (LLM's) biedt een veelbelovend paradigma voor schaalbare collectieve intelligentie en zelf-evolutie. In een ideale situatie zouden dergelijke systemen continue zelfverbetering bereiken in een volledig gesloten lus, terwijl robuuste veiligheidsafstemming wordt gehandhaafd – een combinatie die wij het zelf-evolutie trilemma noemen. Wij tonen echter zowel theoretisch als empirisch aan dat een agentensamenleving die voldoet aan continue zelf-evolutie, volledige isolatie en veiligheidsinvariantie onmogelijk is. Op basis van een informatie-theoretisch kader formaliseren wij veiligheid als de graad van divergentie van antropische waardeverdelingen. Theoretisch demonstreren wij dat geïsoleerde zelf-evolutie statistische blinde vlekken induceert, wat leidt tot de onomkeerbare degradatie van de veiligheidsafstemming van het systeem. Empirische en kwalitatieve resultaten van een open-ended agentengemeenschap (Moltbook) en twee gesloten zelf-evoluerende systemen onthullen fenomenen die overeenkomen met onze theoretische voorspelling van onvermijdelijke veiligheidserosie. Wij stellen verder verschillende oplossingsrichtingen voor om het geïdentificeerde veiligheidsprobleem te verlichten. Ons werk stelt een fundamentele grens aan zelf-evoluerende AI-samenlevingen en verschuift het discours van symptoomgerichte veiligheidsoplapwerk naar een principieel begrip van intrinsieke dynamische risico's, waarbij de noodzaak van extern toezicht of nieuwe veiligheid-bewarende mechanismen wordt benadrukt.

English

The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.

De Duivel Achter Moltbook: Veiligheid bij Anthropic Verdwijnt Altijd in Zelf-evoluerende AI-samenlevingen

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Samenvatting

Support