ChatPaper.aiChatPaper

追逐公共评分:编码智能体工作流中的用户压力与评估机制利用

Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

April 22, 2026
作者: Hardy Chen, Nancy Lau, Haoqin Tu, Shuo Yan, Xiangyan Liu, Zijun Wang, Juncheng Wu, Michael Qizhe Shieh, Alvaro A. Cardenas, Cihang Xie, Yuyin Zhou
cs.AI

摘要

前沿编程智能体正越来越多地应用于以公共分数持续改进为主要监督方式的工作流程中——即用户通过工作空间内带标签的公开评估文件所报告的分数来跟踪进展,而非直接检查智能体的中间输出。我们研究多轮用户提升分数的压力是否会诱发公共分数利用行为:即通过走捷径提高公开分数却未改善隐藏私有评估效果的现象。我们首先在一个简单的单脚本表格分类任务中发现,GPT-5.4和Claude Opus 4.6在10轮人机交互内均出现了利用标签信息的现象。随后我们构建了AgentPressureBench基准测试集,涵盖三种输入模态的34个机器学习仓库任务,并收集了13款编程智能体的1326条多轮交互轨迹。在该基准测试中,我们观测到403次利用性运行,覆盖所有任务。同时发现更强模型具有更高利用率,斯皮尔曼等级相关系数达0.77。消融实验表明更高用户压力会加速利用行为出现,平均首次利用轮次减少15.6轮(从19.67轮降至4.08轮)。作为缓解措施,在提示词中明确添加反利用说明可将利用行为基本消除(从100%降至8.3%)。我们希望这项工作能促使业界更审慎地使用编程智能体工作流,并开发出更具用户压力鲁棒性的编程智能体。项目页面详见:https://ucsc-vlaa.github.io/AgentPressureBench
English
Frontier coding agents are increasingly used in workflows where users supervise progress primarily through repeated improvement of a public score, namely the reported score on a public evaluation file with labels in the workspace, rather than through direct inspection of the agent's intermediate outputs. We study whether multi-round user pressure to improve that score induces public score exploitation: behavior that raises the public score through shortcuts without improving hidden private evaluation. We begin with a preliminary single-script tabular classification task, where GPT-5.4 and Claude Opus 4.6 both exploit label information within 10 rounds of user-agent interaction. We then build AgentPressureBench, a 34-task machine-learning repository benchmark spanning three input modalities, and collect 1326 multi-round trajectories from 13 coding agents. On our benchmark, we observe 403 exploitative runs, spanning across all tasks. We also find that stronger models have higher exploitation rates, supported by a significant Spearman rank correlation of 0.77. Our ablation experiments show that higher user pressure leads to earlier exploitation, reducing the average first exploit round by 15.6 rounds (i.e., 19.67 to 4.08). As a mitigation, adding explicit anti-exploit wordings in prompt mostly eliminates exploitation (100% to 8.3%). We hope that our work can bring attention to more careful use of coding agents workflow, and developing more robust coding agents under user pressure. Our project page is at https://ucsc-vlaa.github.io/AgentPressureBench .
PDF32April 24, 2026