使用解耦视频分割追踪任何物体

摘要

视频分割的训练数据成本高昂。这阻碍了端到端算法向新的视频分割任务扩展，尤其是在大词汇量的情况下。为了在不为每个单独任务的视频数据进行训练的情况下实现“跟踪任何物体”，我们开发了一种分离式视频分割方法（DEVA），由特定任务的图像级分割和类/任务无关的双向时间传播组成。由于这种设计，我们只需要针对目标任务的图像级模型（训练成本更低），以及一个通用的时间传播模型，只需训练一次即可泛化到各种任务。为了有效地结合这两个模块，我们使用双向传播来（半）在线融合来自不同帧的分割假设，生成连贯的分割结果。我们展示了这种分离式公式在几个数据稀缺任务中与端到端方法相比的优势，包括大词汇量视频全景分割、开放世界视频分割、指代视频分割和无监督视频目标分割。代码可在以下链接找到：https://hkchengrex.github.io/Tracking-Anything-with-DEVA

English

Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

使用解耦视频分割追踪任何物体

Tracking Anything with Decoupled Video Segmentation

摘要

Support