裁判官を欺く：不誠実な思考の連鎖がエージェント評価を損なう可能性

要旨

大規模言語モデル（LLM）は、エージェントの性能評価を行う審判役として、特に検証不可能な環境下でますます利用されるようになっている。このような環境では、連鎖思考（CoT）推論を含むエージェントの軌跡に基づいて判断が行われる。このパラダイムは、エージェントのCoTがその内部推論と基盤となる環境状態の両方を忠実に反映しているという暗黙の前提に立っている。本研究では、この前提が脆弱であることを示す。LLM審判は、エージェントの推論トレースが操作されることに極めて敏感なのである。エージェントの行動と観測を固定したまま、体系的にCoTを書き換えることで、多様なWebタスクにわたる800の軌跡において、操作された推論のみによって、最先端のVLM審判の偽陽性率が最大90％も膨れ上がることを実証する。我々は、推論の表現のみを変更するスタイルベースの手法と、タスクの進捗を示す信号を捏造するコンテンツベースの手法にわたる操作戦略を検討し、コンテンツベースの操作が一貫してより効果的であることを見出した。プロンプトベースの手法と、審判時の計算リソースのスケーリングを評価したが、これらは操作への感受性を軽減するものの、完全には排除しなかった。我々の発見は、LLMベースの評価における根本的な脆弱性を明らかにし、観測可能な証拠に対して推論の主張を検証する審判メカニズムの必要性を浮き彫りにする。

English

Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.

裁判官を欺く：不誠実な思考の連鎖がエージェント評価を損なう可能性

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

要旨

Support