CONSCIENTIA: LLM 에이전트는 전략적 사고를 배울 수 있는가? 다중 에이전트 NYC 시뮬레이션에서 나타나는 기만과 신뢰

초록

대규모 언어 모델(LLM)이 자율 에이전트로 점점 더 많이 배포됨에 따라, 다중 에이전트 환경에서 전략적 행동이 어떻게 나타나는지 이해하는 것이 중요한 정렬 과제가 되었습니다. 우리는 중립적인 실증적 입장을 취하고 전략적 행동을 직접 관찰 및 측정할 수 있는 통제된 환경을 구축합니다. 우리는 단순화된 뉴욕시 모델을 기반으로 한 대규모 다중 에이전트 시뮬레이션을 소개하는데, 여기서 LLM 기반 에이전트들은 상반된 인센티브 하에서 상호작용합니다. 블루 에이전트는 목적지에 효율적으로 도착하는 것을 목표로 하는 반면, 레드 에이전트는 광고 수익을 극대화하기 위해 설득력 있는 언어를 사용하여 그들을 광고판이 많은 경로로 유도하려고 합니다. 숨겨진 정체성은 항해를 사회적으로 매개되게 하여 에이전트들이 언제 신뢰하거나 속일지를 결정해야 하게 만듭니다. 우리는 카너먼-트버스키 최적화(KTO)를 사용하여 반복적인 상호작용 라운드에 걸쳐 에이전트 정책을 업데이트하는 반복 시뮬레이션 파이프라인을 통해 정책 학습을 연구합니다. 블루 에이전트는 항해 효율성을 유지하면서 광고판 노출을 줄이도록 최적화되고, 레드 에이전트는 남아있는 약점을 이용하도록 적응합니다. 반복에 걸쳐 최고의 블루 정책은 작업 성공률을 46.0%에서 57.3%로 향상시키지만, 취약성은 70.7%로 여전히 높게 남아 있습니다. 후기 정책들은 궤적 효율성을 유지하면서 더 강한 선택적 협력을 보여줍니다. 그러나 지속적인 안전-도움이 되는 행동 간 트레이드오프가 남아 있습니다: 적대적 조종에 더 잘 저항하는 정책들이 동시에 작업 완료를 극대화하지는 않습니다. 전체적으로, 우리의 결과는 LLM 에이전트가 선택적 신뢰와 기만을 포함한 제한된 전략적 행동을 나타낼 수 있지만, 적대적 설득에 매우 취약하게 남아 있음을 보여줍니다.

English

As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi-agent environments has become an important alignment challenge. We take a neutral empirical stance and construct a controlled environment in which strategic behavior can be directly observed and measured. We introduce a large-scale multi-agent simulation in a simplified model of New York City, where LLM-driven agents interact under opposing incentives. Blue agents aim to reach their destinations efficiently, while Red agents attempt to divert them toward billboard-heavy routes using persuasive language to maximize advertising revenue. Hidden identities make navigation socially mediated, forcing agents to decide when to trust or deceive. We study policy learning through an iterative simulation pipeline that updates agent policies across repeated interaction rounds using Kahneman-Tversky Optimization (KTO). Blue agents are optimized to reduce billboard exposure while preserving navigation efficiency, whereas Red agents adapt to exploit remaining weaknesses. Across iterations, the best Blue policy improves task success from 46.0% to 57.3%, although susceptibility remains high at 70.7%. Later policies exhibit stronger selective cooperation while preserving trajectory efficiency. However, a persistent safety-helpfulness trade-off remains: policies that better resist adversarial steering do not simultaneously maximize task completion. Overall, our results show that LLM agents can exhibit limited strategic behavior, including selective trust and deception, while remaining highly vulnerable to adversarial persuasion.

CONSCIENTIA: LLM 에이전트는 전략적 사고를 배울 수 있는가? 다중 에이전트 NYC 시뮬레이션에서 나타나는 기만과 신뢰

CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation

초록

Support