AgentIF-OneDay:面向日常场景通用智能体的任务级指令遵循基准
AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios
January 28, 2026
作者: Kaiyuan Chen, Qimin Wu, Taiyu Hou, Tianhao Tang, Xueyu Hu, Yuchen Hou, Bikun Li, Chengming Qian, Guoyin Wang, Haolin Chen, Haotong Tian, Haoye Zhang, Haoyu Bian, Hongbing Pan, Hongkang Zhang, Hongyi Zhou, Jiaqi Cai, Jiewu Rao, Jiyuan Ren, Keduan Huang, Lucia Zhu Huang, Mingyu Yuan, Naixu Guo, Qicheng Tang, Qinyan Zhang, Shuai Chen, Siheng Chen, Ting Ting Li, Xiaoxing Guo, Yaocheng Zuo, Yaoqi Guo, Yinan Wang, Yinzhou Yu, Yize Wang, Yuan Jiang, Yuan Tian, Yuanshuo Zhang, Yuxuan Liu, Yvette Yan Zeng, Zenyu Shan, Zihan Yin, Xiaobo Hu, Yang Liu, Yixin Ren, Yuan Gong
cs.AI
摘要
人工智能代理处理日益复杂和长周期任务的能力持续提升,在编程、深度研究和复杂问题解决评估中展现出卓越性能。然而在日常场景中,普通用户对这些先进AI能力的认知仍存在局限。我们认为当前评估体系过度关注任务难度的提升,却未能充分涵盖广泛人群日常工作、生活及学习所需的多样化代理任务。为此,我们提出AgentIF-OneDay基准框架,旨在验证普通用户能否通过自然语言指令和AI代理完成多元化的日常任务。这些任务不仅需要通过对话解决问题,还要求理解多种附件类型并交付可触达的文件化成果。该基准围绕三个用户中心维度构建:评估显性复杂工作流执行能力的"开放工作流执行"、要求从附件推断隐性指令的"潜在指令解析",以及涉及对进行中任务修改扩展的"迭代优化"。我们采用实例级量规和改进的评估流程,使基于大语言模型的验证与人类判断保持一致,使用Gemini-3-Pro实现了80.1%的判定一致率。AgentIF-OneDay包含104项任务,覆盖767个评分点。通过对四款主流通用AI代理的测试发现,基于API构建的代理产品与基于强化学习的ChatGPT代理仍同时处于第一梯队。领先的大语言模型API和开源模型已内化代理能力,使AI应用团队能够开发前沿的代理产品。
English
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.