UniRef++: 공간적 및 시간적 영역에서 모든 참조 객체 분할하기

초록

참조 기반 객체 분할 작업들, 즉 참조 이미지 분할(Referring Image Segmentation, RIS), 소수 샷 이미지 분할(Few-Shot Image Segmentation, FSS), 참조 비디오 객체 분할(Referring Video Object Segmentation, RVOS), 그리고 비디오 객체 분할(Video Object Segmentation, VOS)은 언어 또는 주석이 달린 마스크를 참조로 사용하여 특정 객체를 분할하는 것을 목표로 합니다. 각 분야에서 상당한 진전이 있었음에도 불구하고, 현재의 방법들은 작업별로 특화되어 설계되고 다양한 방향으로 발전되어 왔으며, 이는 이러한 작업들에 대한 다중 작업 능력의 활성화를 방해하고 있습니다. 본 연구에서는 이러한 분열된 상황을 종결하고, 단일 아키텍처로 네 가지 참조 기반 객체 분할 작업을 통합하는 UniRef++를 제안합니다. 우리의 접근 방식의 핵심은 제안된 UniFusion 모듈로, 이 모듈은 지정된 참조에 따라 다양한 작업을 처리하기 위한 다중 방식 융합을 수행합니다. 그리고 인스턴스 수준 분할을 달성하기 위해 통합된 Transformer 아키텍처를 채택합니다. 이러한 통합 설계를 통해 UniRef++는 다양한 벤치마크에서 공동으로 학습될 수 있으며, 런타임에 해당 참조를 지정함으로써 유연하게 여러 작업을 완료할 수 있습니다. 우리는 다양한 벤치마크에서 통합 모델을 평가합니다. 광범위한 실험 결과는 제안된 UniRef++가 RIS와 RVOS에서 최첨단 성능을 달성하고, 매개변수 공유 네트워크를 통해 FSS와 VOS에서도 경쟁력 있는 성능을 보여줌을 나타냅니다. 또한, 제안된 UniFusion 모듈이 현재의 고급 기초 모델인 SAM에 쉽게 통합될 수 있으며, 매개변수 효율적인 미세 조정을 통해 만족스러운 결과를 얻을 수 있음을 보여줍니다. 코드와 모델은 https://github.com/FoundationVision/UniRef에서 확인할 수 있습니다.

English

The reference-based object segmentation tasks, namely referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and video object segmentation (VOS), aim to segment a specific object by utilizing either language or annotated masks as references. Despite significant progress in each respective field, current methods are task-specifically designed and developed in different directions, which hinders the activation of multi-task capabilities for these tasks. In this work, we end the current fragmented situation and propose UniRef++ to unify the four reference-based object segmentation tasks with a single architecture. At the heart of our approach is the proposed UniFusion module which performs multiway-fusion for handling different tasks with respect to their specified references. And a unified Transformer architecture is then adopted for achieving instance-level segmentation. With the unified designs, UniRef++ can be jointly trained on a broad range of benchmarks and can flexibly complete multiple tasks at run-time by specifying the corresponding references. We evaluate our unified models on various benchmarks. Extensive experimental results indicate that our proposed UniRef++ achieves state-of-the-art performance on RIS and RVOS, and performs competitively on FSS and VOS with a parameter-shared network. Moreover, we showcase that the proposed UniFusion module could be easily incorporated into the current advanced foundation model SAM and obtain satisfactory results with parameter-efficient finetuning. Codes and models are available at https://github.com/FoundationVision/UniRef.

UniRef++: 공간적 및 시간적 영역에서 모든 참조 객체 분할하기

UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces

초록

Support