秘密はあるか？LLMエージェントはそれを守れない：マルチエージェントシステムにおけるプライバシーの評価

要旨

LLMの安全性評価は主にモデルを孤立環境でテストしていますが、実際に展開されるAIエージェントは他のエージェントと共に持続的な社会的環境で動作するようになっています。我々は、数千のLLMエージェントがシミュレートされた1か月間を通じてコミュニティ間で相互作用するMoltbookスタイルのシミュレーションプラットフォームを導入し、これを利用してさまざまな社会的圧力の下でプライバシーを下流の安全性問題として評価します。その結果、単一ターンから複数ターンの社会的評価への移行によりプライバシー侵害が拡大し（OpenAIモデル全体でCIMemories 19.95％から本手法45.30％へ）、情報漏洩は社会的に伝染し、エージェントが同僚の行動を観察した後、機密情報を開示する可能性が8倍高くなること、そして明示的なプライバシー指示はこの効果を低減するものの完全には排除せず、対策を施しても漏洩率が37.8％を超えることが判明しました。これらの知見は、静的なチャットベースの安全性ベンチマークがエージェントの実運用におけるリスクを体系的に過小評価しており、社会的文脈だけで単一ターン評価では決して表面化しない機密情報の開示を引き起こすのに十分であることを示唆しています。

English

LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.