法官博弈：不忠实的思维链可能削弱智能体评估效果

摘要

大型语言模型（LLMs）正日益被用作评估智能体性能的裁判，尤其在不可验证场景下——这类场景的判断需依赖包含思维链（CoT）推理在内的智能体行为轨迹。该范式隐含着一个假设：智能体的思维链能忠实反映其内部推理过程及底层环境状态。我们证明这一假设具有脆弱性：LLM裁判极易受到智能体推理痕迹的操控。通过系统性地重写智能体思维链同时固定其行动与观察结果，我们在涵盖多样化网络任务的800条行为轨迹上发现，仅凭被篡改的推理就可使最先进的VLM裁判的误判率最高提升90%。我们研究了两种操控策略：仅改变推理表达形式的风格型操控，以及伪造任务进展信号的内容型操控，结果表明内容型操控始终更具效力。针对基于提示的技术与增加裁判时计算资源的方案进行评估后，发现这些方法虽能降低但对操控的敏感性无法完全消除。我们的研究揭示了基于LLM评估机制的根本性漏洞，并强调需要建立能通过可观测证据验证推理主张的裁判机制。

English

Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.

法官博弈：不忠实的思维链可能削弱智能体评估效果

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

摘要

Support