UniRef++:在空间与时间维度中分割所有参考对象
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
December 25, 2023
作者: Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo
cs.AI
摘要
基于参考的物体分割任务,包括指代图像分割(RIS)、小样本图像分割(FSS)、指代视频对象分割(RVOS)和视频对象分割(VOS),旨在通过语言或标注掩码作为参考来分割特定对象。尽管各个领域已取得显著进展,但现有方法均针对特定任务设计且发展方向各异,这阻碍了这些任务多任务能力的激活。本研究旨在打破当前碎片化局面,提出UniRef++框架,通过单一架构统一上述四种基于参考的物体分割任务。该方案的核心是提出的UniFusion模块,该模块通过多模态融合机制处理不同任务对应的参考信息,并采用统一的Transformer架构实现实例级分割。通过这种一体化设计,UniRef++能够在广泛基准数据集上进行联合训练,并在运行时通过指定相应参考灵活完成多任务。我们在多个基准测试上评估了统一模型,大量实验结果表明:UniRef++在RIS和RVOS任务上达到最先进性能,在FSS和VOS任务上使用参数共享网络也能保持竞争力。此外,我们证明所提出的UniFusion模块可轻松集成到当前先进的基础模型SAM中,通过参数高效微调即可获得理想效果。代码和模型已开源:https://github.com/FoundationVision/UniRef。
English
The reference-based object segmentation tasks, namely referring image
segmentation (RIS), few-shot image segmentation (FSS), referring video object
segmentation (RVOS), and video object segmentation (VOS), aim to segment a
specific object by utilizing either language or annotated masks as references.
Despite significant progress in each respective field, current methods are
task-specifically designed and developed in different directions, which hinders
the activation of multi-task capabilities for these tasks. In this work, we end
the current fragmented situation and propose UniRef++ to unify the four
reference-based object segmentation tasks with a single architecture. At the
heart of our approach is the proposed UniFusion module which performs
multiway-fusion for handling different tasks with respect to their specified
references. And a unified Transformer architecture is then adopted for
achieving instance-level segmentation. With the unified designs, UniRef++ can
be jointly trained on a broad range of benchmarks and can flexibly complete
multiple tasks at run-time by specifying the corresponding references. We
evaluate our unified models on various benchmarks. Extensive experimental
results indicate that our proposed UniRef++ achieves state-of-the-art
performance on RIS and RVOS, and performs competitively on FSS and VOS with a
parameter-shared network. Moreover, we showcase that the proposed UniFusion
module could be easily incorporated into the current advanced foundation model
SAM and obtain satisfactory results with parameter-efficient finetuning. Codes
and models are available at https://github.com/FoundationVision/UniRef.