別猜，直接問：透過多輪澄清解決指代分割中的歧義

摘要

指代分割旨在根據文字查詢來分割圖像或影片中的目標物件。儘管過去幾年間取得了顯著進展，現有研究通常假設用戶提供的查詢已經精確且清晰。然而，此假設在實際應用中並不成立。在真實場景中，期望所有用戶都詳細審視其視覺內容，並仔細確保其查詢具有獨特性且無歧義，這是不切實際的。當遇到此類情況時，現有的分割模型往往會隨意猜測用戶的偏好，常常導致不理想的結果。為了解決這一限制，我們提出了IC-Seg，這是一個新穎的智能體框架，能在分割之前透過多輪對話主動釐清用戶意圖。為了有效激發此能力，我們進一步引入了Hi-GRPO，這是一種新的分層優化策略，在軌跡、輪次和步驟層級注入密集且富含資訊的監督信號。此策略鼓勵高效釐清意圖，有效消除冗餘互動，並提升整體對話品質。在評估方面，我們建立了Ambi-RVOS，這是一個包含模糊用戶查詢的指代影片物件分割基準。大量實驗表明，IC-Seg不僅在解決模糊查詢方面大幅優於現有方法，而且在標準推理分割基準上仍保持最先進的性能。程式碼與資料將於 https://github.com/iSEE-Laboratory/IC-Seg 發布。

English

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE-Laboratory/IC-Seg.