问还是猜？编程助手中基于不确定性的澄清式提问机制

摘要

随着大语言模型（LLM）智能体在软件工程等开放领域日益广泛应用，它们频繁面临关键上下文缺失的模糊指令。虽然人类开发者能通过主动提问自然化解模糊性，但当前智能体主要针对自主执行进行优化。本研究基于SWE-bench Verified的模糊指令变体，系统评估了LLM智能体的澄清寻求能力。我们提出一种不确定性感知的多智能体框架，将模糊性检测与代码执行显式解耦。实验结果表明：采用OpenHands+Claude Sonnet 4.5的多智能体系统任务解决率达到69.40%，显著优于标准单智能体设置（61.20%），并与处理完整明确指令的智能体性能差距大幅缩小。进一步研究发现，该多智能体系统展现出良好校准的不确定性判断能力——对简单任务保持查询克制，而对复杂问题则主动寻求信息。这些发现表明，现有模型可转化为主动协作型智能体，使其在现实世界的模糊任务中能自主识别提问时机以获取缺失信息。

English

As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.