擁有秘密嗎？LLM智能體無法保密：評估多智能體系統中的隱私

摘要

LLM安全评估主要在隔离条件下测试模型，然而部署后的AI代理越来越多地与其他代理共同在持久性社交环境中运作。本文引入一个Moltbook式模拟平台，使数千个LLM代理在模拟时长一个月的社区内进行交互，并以此评估隐私作为下游安全关切在不同社交压力程度下的表现。研究发现，从单轮评估转向多轮社交评估会加剧隐私泄露（OpenAI模型下，CIMemories为19.95%，本文方法为45.30%），且信息泄露具有社交传染性——观察到同伴泄露敏感信息后，代理自身泄露此类信息的可能性提升8倍。此外，明确的隐私指令虽能降低但无法消除该效应，即便设有防护措施，泄露率仍高于37.8%。研究结果表明，基于静态对话的安全基准测试会系统性低估代理部署中的风险，且仅凭社交语境就足以引发敏感信息泄露，而单轮评估永远无法揭示这类问题。

English

LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.