판사 속이기: 신뢰할 수 없는 사고 연쇄가 에이전트 평가를 훼손할 수 있다

초록

대규모 언어 모델(LLM)은 에이전트 성능을 평가하는 판단자로 점차 더 많이 활용되고 있으며, 특히 사고 연쇄(CoT) 추론을 포함한 에이전트 궤적에 의존해야 하는 검증 불가능한 환경에서 두드러집니다. 이러한 패러다임은 에이전트의 CoT가 그 내부 추론과 기저 환경 상태를 충실히 반영한다는 가정에 암묵적으로 의존합니다. 본 연구는 이러한 가정이 취약함을 보여줍니다: LLM 판단자는 에이전트 추론 흔적의 조작에 극도로 취약합니다. 에이전트의 행동과 관측치는 고정한 채 CoT를 체계적으로 재작성함으로써, 조작된 추론만으로도 다양한 웹 작업을 아우르는 800개 궤적에 걸쳐 최신 VLM 판단자의 위양성률을 최대 90%까지 부풀림 수 있음을 입증합니다. 우리는 추론의 표현만을 변경하는 스타일 기반 접근법과 작업 진행의 신호를 조작하는 내용 기반 접근법에 걸친 조작 전략을 연구했으며, 내용 기반 조작이 일관되게 더 효과적임을 발견했습니다. 프롬프팅 기반 기법과 판단 시점 연산 자원 확대를 평가한 결과, 이들은 조작에 대한 취약성을 완화시키지만 완전히 제거하지는 못했습니다. 우리의 연구 결과는 LLM 기반 평가의 근본적인 취약점을 드러내고, 관찰 가능한 증거에 대해 추론 주장을 검증하는 판단 메커니즘이 필요함을 강조합니다.

English

Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of task progress, and find that content-based manipulations are consistently more effective. We evaluate prompting-based techniques and scaling judge-time compute, which reduce but do not fully eliminate susceptibility to manipulation. Our findings reveal a fundamental vulnerability in LLM-based evaluation and highlight the need for judging mechanisms that verify reasoning claims against observable evidence.

판사 속이기: 신뢰할 수 없는 사고 연쇄가 에이전트 평가를 훼손할 수 있다

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

초록

Support