HiL-Bench（人机协同基准测试）：智能体是否懂得适时求助？

摘要

前沿编码智能体在获得完整上下文时能解决复杂任务，但在规范不完整或模糊时就会失效。瓶颈并非原始能力，而是判断力：即知道何时自主行动、何时需要求助。现有基准测试对这种失效模式视而不见——它们提供明确详尽的指令，且仅以执行正确性作为评判标准，导致对缺失要求进行侥幸猜对的智能体，与那些会主动求证以确保准确的智能体获得相同评分。我们推出HiL-Bench（人机回环基准测试）来衡量这种选择性升级决策能力。每个任务均包含经人工验证的阻碍因素（信息缺失、模糊请求、矛盾信息），这些因素仅能通过渐进式探索而非前期检查来发现。我们的核心指标Ask-F1（提问精确率与阻碍因素召回率的调和平均数）捕捉了过度提问与沉默猜测之间的矛盾关系；其结构设计从机制上防止了通过问题轰炸来刷分的漏洞。在软件工程和文本转SQL领域的评估揭示了普遍存在的判断力鸿沟：当前所有前沿模型在自主决定是否求助时，其表现均未能达到全信息情境下性能的一小部分。故障分析识别出三种关键求助模式：过度自信的错误判断且无差距感知；高不确定性感知却持续出错；宽泛而不精确的升级请求且缺乏自我修正。这些一致性模式证实了拙劣的求助行为是模型层面的缺陷，而非特定任务导致。基于Ask-F1奖励函数的强化学习训练表明判断力具有可塑性：一个320亿参数模型在提升求助质量的同时也提高了任务通过率，且这种增益具有跨领域迁移性。该模型并未学习何时提问的领域特定启发式规则，而是学会了检测不可化解的不确定性并据此行动。

English

Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.

HiL-Bench（人机协同基准测试）：智能体是否懂得适时求助？

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

摘要

Support