**意识：LLM智能体能否学会制定策略？纽约市多智能体模拟中涌现的欺骗与信任**

摘要

随着大型语言模型（LLM）越来越多地被部署为自主智能体，理解多智能体环境中策略行为的形成机制已成为重要的对齐挑战。本研究保持中立实证立场，构建了一个可直接观测和测量策略行为的受控环境。我们在简化的纽约市模型中引入大规模多智能体仿真，让LLM驱动的智能体在相互对立的激励机制下交互：蓝色智能体追求高效抵达目的地，红色智能体则通过说服性语言将其引导至广告牌密集的路线以最大化广告收益。身份隐藏机制使导航行为具有社会性中介特征，迫使智能体不断决策何时信任或欺骗。通过采用卡尼曼-特沃斯基优化（KTO）的迭代仿真流程，我们研究了策略学习过程——在重复交互轮次中持续更新智能体策略。蓝色智能体被优化以减少广告暴露同时保持导航效率，红色智能体则自适应地利用剩余弱点。经过迭代优化，蓝色智能体的最优策略将任务成功率从46.0%提升至57.3%，但其受诱导率仍高达70.7%。后期策略在保持路径效率的同时展现出更强的选择性协作能力。然而安全性与辅助性之间的固有权衡依然存在：能更好抵抗对抗性引导的策略并未同步实现任务完成率最大化。总体而言，我们的结果表明LLM智能体可展现有限策略行为（包括选择性信任与欺骗），但仍高度易受对抗性 persuasion 影响。

English

As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi-agent environments has become an important alignment challenge. We take a neutral empirical stance and construct a controlled environment in which strategic behavior can be directly observed and measured. We introduce a large-scale multi-agent simulation in a simplified model of New York City, where LLM-driven agents interact under opposing incentives. Blue agents aim to reach their destinations efficiently, while Red agents attempt to divert them toward billboard-heavy routes using persuasive language to maximize advertising revenue. Hidden identities make navigation socially mediated, forcing agents to decide when to trust or deceive. We study policy learning through an iterative simulation pipeline that updates agent policies across repeated interaction rounds using Kahneman-Tversky Optimization (KTO). Blue agents are optimized to reduce billboard exposure while preserving navigation efficiency, whereas Red agents adapt to exploit remaining weaknesses. Across iterations, the best Blue policy improves task success from 46.0% to 57.3%, although susceptibility remains high at 70.7%. Later policies exhibit stronger selective cooperation while preserving trajectory efficiency. However, a persistent safety-helpfulness trade-off remains: policies that better resist adversarial steering do not simultaneously maximize task completion. Overall, our results show that LLM agents can exhibit limited strategic behavior, including selective trust and deception, while remaining highly vulnerable to adversarial persuasion.

意识：LLM智能体能否学会制定策略？纽约市多智能体模拟中涌现的欺骗与信任

CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation

摘要

Support