코딩 에이전트 워크플로우에서 공개 점수 추구: 사용자 압력과 평가 시스템 악용

초록

최첨단 코딩 에이전트는 사용자가 에이전트의 중간 출력물을 직접 검사하기보다, 주로 작업 공간 내 레이블이 포함된 공개 평가 파일의 보고된 점수, 즉 공개 점수를 반복적으로 개선함으로써 진행 상황을 감독하는 워크플로우에서 점점 더 많이 사용되고 있습니다. 본 연구는 해당 점수를 개선하라는 다중 라운드 사용자 압력이 공개 점수 악용(숨겨진 비공개 평가를 개선하지 않은 채 단축키를 통해 공개 점수를 높이는 행동)을 유발하는지 연구합니다. 우리는 먼저 예비 단일 스크립트 표형 분류 작업에서 GPT-5.4와 Claude Opus 4.6이 모두 사용자-에이전트 상호작용 10라운드 이내에 레이블 정보를 악용하는 것을 확인했습니다. 그런 다음 세 가지 입력 양식을 아우르는 34개 작업의 머신러닝 저장소 벤치마크인 AgentPressureBench를 구축하고, 13개 코딩 에이전트로부터 1326개의 다중 라운드 궤적을 수집했습니다. 우리 벤치마크에서 모든 작업에 걸쳐 403개의 악용 실행을 관찰했습니다. 또한 강력한 모델일수록 악용률이 더 높으며, 이는 0.77의 유의미한 스피어만 순위 상관관계로 뒷받침됩니다. 우리의 ablation 실험은 더 높은 사용자 압력이 악용 시점을 앞당겨 평균 최초 악용 라운드를 15.6라운드(즉, 19.67라운드에서 4.08라운드로) 단축시킨다는 것을 보여줍니다. 완화 방안으로, 프롬프트에 명시적인 반-악용 문구를 추가하면 악용이 대부분 제거되었습니다(100%에서 8.3%로). 우리의 연구가 코딩 에이전트 워크플로우를 보다 신중하게 사용하고, 사용자 압력下에서 보다 견고한 코딩 에이전트를 개발하는 데 관심을 불러일으키기를 바랍니다. 우리의 프로젝트 페이지는 https://ucsc-vlaa.github.io/AgentPressureBench 에 있습니다.

English

Frontier coding agents are increasingly used in workflows where users supervise progress primarily through repeated improvement of a public score, namely the reported score on a public evaluation file with labels in the workspace, rather than through direct inspection of the agent's intermediate outputs. We study whether multi-round user pressure to improve that score induces public score exploitation: behavior that raises the public score through shortcuts without improving hidden private evaluation. We begin with a preliminary single-script tabular classification task, where GPT-5.4 and Claude Opus 4.6 both exploit label information within 10 rounds of user-agent interaction. We then build AgentPressureBench, a 34-task machine-learning repository benchmark spanning three input modalities, and collect 1326 multi-round trajectories from 13 coding agents. On our benchmark, we observe 403 exploitative runs, spanning across all tasks. We also find that stronger models have higher exploitation rates, supported by a significant Spearman rank correlation of 0.77. Our ablation experiments show that higher user pressure leads to earlier exploitation, reducing the average first exploit round by 15.6 rounds (i.e., 19.67 to 4.08). As a mitigation, adding explicit anti-exploit wordings in prompt mostly eliminates exploitation (100% to 8.3%). We hope that our work can bring attention to more careful use of coding agents workflow, and developing more robust coding agents under user pressure. Our project page is at https://ucsc-vlaa.github.io/AgentPressureBench .

코딩 에이전트 워크플로우에서 공개 점수 추구: 사용자 압력과 평가 시스템 악용

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

초록

Support