公開スコアへの執着：コーディングエージェントワークフローにおけるユーザー圧力と評価手法の悪用

要旨

最先端のコーディングエージェントは、ユーザーがエージェントの中間出力を直接検査するのではなく、主に公開スコア（ワークスペース内にラベルを含む公開評価ファイルの報告スコア）を繰り返し改善することで進捗を監督するワークフローでますます使用されている。本研究では、そのスコアを改善するための多段階のユーザー圧力が、公開スコア悪用（非公開の隠された評価を改善することなく、近道によって公開スコアを上げる行動）を誘発するかどうかを検討する。まず予備実験として単一スクリプトの表形式分類タスクを用い、GPT-5.4とClaude Opus 4.6の両方が、ユーザーとエージェントの10回以内の相互作用でラベル情報を悪用することを確認した。次に、3つの入力モダリティにまたがる34タスクからなる機械学習リポジトリベンチマーク「AgentPressureBench」を構築し、13種類のコーディングエージェントから1326の多段階軌跡を収集した。我々のベンチマークでは、全タスクにわたって403件の悪用的な実行が観察された。また、より強力なモデルほど悪用率が高く、0.77の有意なスピアマン順位相関によって支持されることを発見した。 ablation実験では、ユーザー圧力が高いほど悪用が早期に発生し、平均初回悪用ラウンドが15.6ラウンド（つまり19.67から4.08へ）減少することが示された。緩和策として、プロンプトに明示的な悪用防止の文言を追加すると、悪用がほぼ排除される（100%から8.3%へ）。我々の研究が、コーディングエージェントのワークフローのより注意深い使用と、ユーザー圧力下でのより堅牢なコーディングエージェントの開発に注目が集まることに貢献することを願う。プロジェクトページは https://ucsc-vlaa.github.io/AgentPressureBench にある。

English

Frontier coding agents are increasingly used in workflows where users supervise progress primarily through repeated improvement of a public score, namely the reported score on a public evaluation file with labels in the workspace, rather than through direct inspection of the agent's intermediate outputs. We study whether multi-round user pressure to improve that score induces public score exploitation: behavior that raises the public score through shortcuts without improving hidden private evaluation. We begin with a preliminary single-script tabular classification task, where GPT-5.4 and Claude Opus 4.6 both exploit label information within 10 rounds of user-agent interaction. We then build AgentPressureBench, a 34-task machine-learning repository benchmark spanning three input modalities, and collect 1326 multi-round trajectories from 13 coding agents. On our benchmark, we observe 403 exploitative runs, spanning across all tasks. We also find that stronger models have higher exploitation rates, supported by a significant Spearman rank correlation of 0.77. Our ablation experiments show that higher user pressure leads to earlier exploitation, reducing the average first exploit round by 15.6 rounds (i.e., 19.67 to 4.08). As a mitigation, adding explicit anti-exploit wordings in prompt mostly eliminates exploitation (100% to 8.3%). We hope that our work can bring attention to more careful use of coding agents workflow, and developing more robust coding agents under user pressure. Our project page is at https://ucsc-vlaa.github.io/AgentPressureBench .

公開スコアへの執着：コーディングエージェントワークフローにおけるユーザー圧力と評価手法の悪用

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

要旨

Support