意识：LLM智能体能否学会制定策略？纽约市多智能体模拟中涌现的欺骗与信任行为

摘要

随着大语言模型作为自主智能体被日益广泛应用，理解多智能体环境中策略行为的形成机制已成为重要的对齐挑战。本研究秉持中立实证立场，构建了可直接观测和度量策略行为的受控环境。我们在简化的纽约市模型中搭建了大规模多智能体仿真系统，让基于大语言模型的智能体在相互对立的激励机制下交互：蓝色智能体追求高效抵达目的地，红色智能体则试图通过说服性语言将对方引向广告牌密集的路线以最大化广告收益。身份隐藏机制使得导航行为具有社会性中介特征，迫使智能体不断决策何时信任或欺骗。通过采用卡尼曼-特沃斯基优化算法的迭代仿真流程，我们研究了策略学习过程——在重复交互轮次中持续更新智能体策略。蓝色智能体被优化以减少广告暴露同时保持导航效率，红色智能体则自适应地利用剩余弱点。经过迭代优化，最优蓝色策略将任务成功率从46.0%提升至57.3%，但易受攻击性仍高达70.7%。后期策略在保持路径效率的同时展现出更强的选择性协作能力。然而安全性与辅助性之间的权衡始终存在：更有效抵抗对抗性引导的策略并未同步实现任务完成率最大化。总体而言，我们的结果表明大语言模型智能体能够展现有限度的策略行为（包括选择性信任与欺骗），但仍高度易受对抗性 persuasion 影响。

English

As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi-agent environments has become an important alignment challenge. We take a neutral empirical stance and construct a controlled environment in which strategic behavior can be directly observed and measured. We introduce a large-scale multi-agent simulation in a simplified model of New York City, where LLM-driven agents interact under opposing incentives. Blue agents aim to reach their destinations efficiently, while Red agents attempt to divert them toward billboard-heavy routes using persuasive language to maximize advertising revenue. Hidden identities make navigation socially mediated, forcing agents to decide when to trust or deceive. We study policy learning through an iterative simulation pipeline that updates agent policies across repeated interaction rounds using Kahneman-Tversky Optimization (KTO). Blue agents are optimized to reduce billboard exposure while preserving navigation efficiency, whereas Red agents adapt to exploit remaining weaknesses. Across iterations, the best Blue policy improves task success from 46.0% to 57.3%, although susceptibility remains high at 70.7%. Later policies exhibit stronger selective cooperation while preserving trajectory efficiency. However, a persistent safety-helpfulness trade-off remains: policies that better resist adversarial steering do not simultaneously maximize task completion. Overall, our results show that LLM agents can exhibit limited strategic behavior, including selective trust and deception, while remaining highly vulnerable to adversarial persuasion.