不要猜测，直接询问：通过多轮澄清解决指代分割中的歧义

摘要

指代分割旨在根据文本查询对图像或视频中的目标对象进行分割。尽管过去几年取得了显著进展，现有研究通常假设用户提供的查询已经精确且清晰。然而，这种假设并不切实际。在真实场景中，期望所有用户都仔细审查视觉内容并确保其查询唯一且无歧义是不现实的。当遇到此类情况时，现有分割模型往往随意猜测用户偏好，常导致不理想的结果。为解决这一局限，我们提出IC-Seg——一种新颖的智能体框架，通过分割前的多轮对话主动澄清用户意图。为了有效激励这种能力，我们进一步引入Hi-GRPO，一种新的分层优化策略，在轨迹、轮次和步骤层面注入密集且信息丰富的监督信号。该策略鼓励高效意图澄清，有效消除冗余交互并提升整体对话质量。为进行评估，我们构建了Ambi-RVOS，一个包含模糊用户查询的指代视频对象分割基准。大量实验表明，IC-Seg不仅在解决模糊查询方面大幅超越现有方法，还在标准推理分割基准上保持最优性能。代码与数据将在 https://github.com/iSEE-Laboratory/IC-Seg 发布。

English

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE-Laboratory/IC-Seg.