对话式图像分割:基于可扩展监督的抽象概念定位
Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
February 13, 2026
作者: Aadarsh Sahoo, Georgia Gkioxari
cs.AI
摘要
对话式图像分割将抽象的意图驱动概念转化为像素级精确的掩码。现有指代性图像定位研究多聚焦于类别与空间查询(如"最左侧的苹果"),却忽视了功能与物理推理(如"哪里能安全存放刀具?")。我们针对这一空白提出对话式图像分割(CIS)概念及ConverSeg基准数据集,涵盖实体识别、空间关系、意图理解、功能属性、安全考量与物理推理等维度。同时推出ConverSeg-Net模型——该模型将强分割先验与语言理解相融合,并采用无需人工标注的AI驱动数据引擎生成提示-掩码对。实验表明,当前语言引导的分割模型难以胜任CIS任务,而基于我们数据引擎训练的ConverSeg-Net在ConverSeg基准上实现显著提升,并在现有语言引导分割基准中保持强劲性能。项目页面:https://glab-caltech.github.io/converseg/
English
Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/