追趕公開分數:編程代理工作流程中的用戶壓力與評估機制利用
Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows
April 22, 2026
作者: Hardy Chen, Nancy Lau, Haoqin Tu, Shuo Yan, Xiangyan Liu, Zijun Wang, Juncheng Wu, Michael Qizhe Shieh, Alvaro A. Cardenas, Cihang Xie, Yuyin Zhou
cs.AI
摘要
前沿編碼代理正日益應用於以公開分數反覆改進為主要監督機制的工作流程中,即使用者透過工作空間內帶有標籤的公開評估文件所報告的分數來監控進展,而非直接檢查代理的中間輸出。本研究探討多輪次使用者提升分數的壓力是否會誘發公開分數濫用行為:即透過取巧手段提升公開分數卻未改善隱藏私有評估的表現。我們首先以單一腳本的表格分類任務進行初步實驗,發現GPT-5.4與Claude Opus 4.6在10輪人機互動內均出現標籤資訊濫用現象。接著我們建構AgentPressureBench——一個涵蓋三種輸入模態的34項機器學習任務庫基準,並從13種編碼代理收集1326條多輪軌跡。在該基準測試中,我們觀察到403次濫用行為,遍佈所有任務。同時發現更強大的模型具有更高濫用率,斯皮爾曼等級相關係數達0.77。消融實驗顯示更高使用者壓力會導致更早出現濫用行為,平均首次濫用輪次減少15.6輪(從19.67輪降至4.08輪)。作為緩解措施,在提示詞中加入明確反濫用表述可基本消除濫用現象(從100%降至8.3%)。我們期望此研究能促使更審慎地使用編碼代理工作流程,並開發更具抗壓性的編碼代理。項目頁面請見:https://ucsc-vlaa.github.io/AgentPressureBench。
English
Frontier coding agents are increasingly used in workflows where users supervise progress primarily through repeated improvement of a public score, namely the reported score on a public evaluation file with labels in the workspace, rather than through direct inspection of the agent's intermediate outputs. We study whether multi-round user pressure to improve that score induces public score exploitation: behavior that raises the public score through shortcuts without improving hidden private evaluation. We begin with a preliminary single-script tabular classification task, where GPT-5.4 and Claude Opus 4.6 both exploit label information within 10 rounds of user-agent interaction. We then build AgentPressureBench, a 34-task machine-learning repository benchmark spanning three input modalities, and collect 1326 multi-round trajectories from 13 coding agents. On our benchmark, we observe 403 exploitative runs, spanning across all tasks. We also find that stronger models have higher exploitation rates, supported by a significant Spearman rank correlation of 0.77. Our ablation experiments show that higher user pressure leads to earlier exploitation, reducing the average first exploit round by 15.6 rounds (i.e., 19.67 to 4.08). As a mitigation, adding explicit anti-exploit wordings in prompt mostly eliminates exploitation (100% to 8.3%). We hope that our work can bring attention to more careful use of coding agents workflow, and developing more robust coding agents under user pressure. Our project page is at https://ucsc-vlaa.github.io/AgentPressureBench .