HiL-Bench（人机协同基准测试）：智能体是否懂得何时求助？

摘要

前沿编码智能体在获得完整上下文时能解决复杂任务，但在规范不完整或模糊时会失效。瓶颈并非原始能力，而是判断力：即知道何时自主行动、何时寻求帮助。现有基准测试对此失效模式视而不见——它们提供明确详尽的指令且仅奖励执行正确性，导致对缺失要求进行侥幸猜测的智能体得分与主动确认的智能体无异。我们提出HiL-Bench（人机回圈基准测试）来衡量这种选择性升级能力。每个任务包含经人工验证的阻碍因素（信息缺失、模糊请求、矛盾信息），这些因素仅通过渐进式探索而非预先检查才会显现。我们的核心指标Ask-F1（提问精准率与阻碍因素召回率的调和平均数）捕捉了过度提问与沉默猜测之间的张力，其结构设计从机制上防止了通过问题轰炸钻营取巧。在软件工程和文本转SQL领域的评估揭示了普遍存在的判断力鸿沟：所有前沿模型在自主决定是否提问时，其表现均未能恢复至全信息情境下的水平。故障分析识别出三种关键求助模式：未能检测认知盲区的过度自信错误；虽能检测高不确定性但仍持续出错；缺乏自我修正的宽泛模糊升级。这些一致模式证实了拙劣的求助行为是模型层面的缺陷，而非任务特定问题。基于Ask-F1奖励的强化学习训练表明判断力具有可塑性：一个320亿参数模型在提升求助质量的同时提高了任务通过率，且增益效果可跨领域迁移。该模型并未学习何时提问的领域特定启发式规则，而是学会了检测不可化解的不确定性并据此行动。

English

Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain. We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam. Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.

HiL-Bench（人机协同基准测试）：智能体是否懂得何时求助？

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

摘要

Support