비밀을 간직하고 있다고? LLM 에이전트는 지킬 수 없다: 다중 에이전트 시스템에서의 프라이버시 평가

초록

LLM 안전성 평가는 주로 모델을 격리된 상태에서 테스트하지만, 배포된 AI 에이전트는 점점 더 다른 에이전트와 함께 지속적인 사회적 환경 내에서 작동하고 있습니다. 우리는 수천 개의 LLM 에이전트가 시뮬레이션된 한 달 동안 커뮤니티 간 상호작용하는 몰트북(Moltbook) 스타일의 시뮬레이션 플랫폼을 도입하고, 이를 사용하여 다양한 수준의 사회적 압박 하에서 개인정보 보호를 다운스트림 안전 문제로 평가합니다. 단일 턴에서 다중 턴 사회적 평가로 전환하면 개인정보 침해가 증가하며(OpenAI 모델 기준 CIMemories 19.95%에서 당사 기준 45.30%로), 정보 유출은 사회적으로 전염되어 에이전트가 동료의 유출 행동을 관찰한 후 민감 정보를 공개할 가능성이 8배 더 높아지며, 명시적 개인정보 보호 지침은 이러한 효과를 줄이지만 완전히 제거하지는 못하여 보호 조치가 있음에도 유출률이 37.8%를 초과하는 것으로 나타났습니다. 본 연구 결과는 정적 채팅 기반 안전 벤치마크가 에이전트 배포 환경의 위험을 체계적으로 과소평가하며, 사회적 맥락만으로도 단일 턴 평가에서는 절대 드러나지 않는 민감 정보 공개를 유발하기에 충분함을 시사합니다.

English

LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.