CONSCIENTIA: LLMエージェントは戦略的思考を学習できるか？マルチエージェントNYCシミュレーションにおける創発的な欺瞞と信頼

要旨

大規模言語モデル（LLM）が自律エージェントとして展開されるにつれ、マルチエージェント環境において戦略的行動がどのように出現するかを理解することが、重要なアライメント課題となっている。本研究は中立的な実証的立場をとり、戦略的行動を直接観察・測定可能な制御環境を構築する。我々は、ニューヨーク市を簡略化したモデルにおける大規模マルチエージェントシミュレーションを導入し、LLM駆動のエージェントが相反するインセンティブ下で相互作用する環境を構築した。ブルーエージェントは効率的に目的地に到達することを目的とし、レッドエージェントは説得的な言語を用いて彼らを広告収入を最大化するビルボードの多い経路へ誘導しようとする。隠蔽されたアイデンティティによりナビゲーションは社会的に媒介され、エージェントはいつ信頼し、いつ欺くかを決定せざるを得ない。我々は、カーネマン・トヴァースキー最適化（KTO）を用いて繰り返しの相互作用ラウンドでエージェント方策を更新する反復シミュレーションパイプラインを通じて方策学習を検証する。ブルーエージェントはナビゲーション効率を維持しつつビルボードへの露出を減らすように最適化され、一方でレッドエージェントは残存する弱点を利用するように適応する。反復を経て、最良のブルー方策はタスク成功率を46.0%から57.3%に改善したが、被影響性は70.7%と依然高い水準にある。後の方策では、軌道効率を維持しつつ、より強い選択的協力行動を示す。しかし、安全性と支援性の間の持続的なトレードオフが残存する：敵対的誘導により強く抵抗する方策は、同時にタスク完了を最大化しない。全体として、我々の結果は、LLMエージェントが選択的信用や欺瞞を含む限定的な戦略的行動を示しうる一方で、敵対的説得に対して非常に脆弱であるままであることを示している。

English

As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi-agent environments has become an important alignment challenge. We take a neutral empirical stance and construct a controlled environment in which strategic behavior can be directly observed and measured. We introduce a large-scale multi-agent simulation in a simplified model of New York City, where LLM-driven agents interact under opposing incentives. Blue agents aim to reach their destinations efficiently, while Red agents attempt to divert them toward billboard-heavy routes using persuasive language to maximize advertising revenue. Hidden identities make navigation socially mediated, forcing agents to decide when to trust or deceive. We study policy learning through an iterative simulation pipeline that updates agent policies across repeated interaction rounds using Kahneman-Tversky Optimization (KTO). Blue agents are optimized to reduce billboard exposure while preserving navigation efficiency, whereas Red agents adapt to exploit remaining weaknesses. Across iterations, the best Blue policy improves task success from 46.0% to 57.3%, although susceptibility remains high at 70.7%. Later policies exhibit stronger selective cooperation while preserving trajectory efficiency. However, a persistent safety-helpfulness trade-off remains: policies that better resist adversarial steering do not simultaneously maximize task completion. Overall, our results show that LLM agents can exhibit limited strategic behavior, including selective trust and deception, while remaining highly vulnerable to adversarial persuasion.

CONSCIENTIA: LLMエージェントは戦略的思考を学習できるか？マルチエージェントNYCシミュレーションにおける創発的な欺瞞と信頼

CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation

要旨

Support