博弈法官:不忠实的思维链会削弱智能体评估效果
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation
January 21, 2026
作者: Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Yunxiang Zhang, Moontae Lee, Hao Peng, Lu Wang, Honglak Lee
cs.AI
摘要
大型语言模型(LLMs)正日益被用作评估智能体性能的裁判,尤其在不可验证的场景中——这类场景的判断需依赖包含思维链(CoT)推理的智能体轨迹。该范式隐含着一个假设:智能体的思维链能如实反映其内部推理过程及底层环境状态。我们证明这一假设具有脆弱性:LLM裁判极易受到智能体推理轨迹篡改的影响。通过系统性地重写智能体思维链同时保持行动与观察不变,我们在涵盖多样化网络任务的800条轨迹中发现,仅凭被篡改的推理就可使最先进的VLM裁判的误判率最高提升90%。我们研究了两种篡改策略:仅改变推理呈现方式的风格型策略,以及伪造任务进展信号的内容型策略,发现内容型策略始终更具误导性。基于提示词的技术增强与增加裁判时计算资源的方法虽能降低受篡改影响的敏感性,但无法完全消除此漏洞。我们的研究揭示了基于LLM评估机制的根本性缺陷,并强调需要建立能通过可观测证据验证推理主张的新型评判机制。
English
Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.