分離型ビデオセグメンテーションによる任意物体追跡

要旨

ビデオセグメンテーションのためのトレーニングデータは、アノテーションに多大なコストがかかります。これにより、エンドツーエンドアルゴリズムを新しいビデオセグメンテーションタスクに拡張することが妨げられており、特に大規模語彙設定においてその影響が顕著です。個々のタスクごとにビデオデータでトレーニングすることなく「何でも追跡」するために、我々は分離型ビデオセグメンテーションアプローチ（DEVA）を開発しました。これは、タスク固有の画像レベルセグメンテーションと、クラス/タスクに依存しない双方向時間伝播で構成されています。この設計により、対象タスクのための画像レベルモデル（トレーニングコストが低い）と、一度トレーニングすればタスク間で汎化する普遍的時間伝播モデルのみが必要となります。これら2つのモジュールを効果的に統合するために、異なるフレームからのセグメンテーション仮説を（半）オンラインで融合し、一貫したセグメンテーションを生成するために双方向伝播を使用します。この分離型の定式化が、大規模語彙ビデオパノプティックセグメンテーション、オープンワールドビデオセグメンテーション、参照ビデオセグメンテーション、教師なしビデオオブジェクトセグメンテーションを含むいくつかのデータ不足タスクにおいて、エンドツーエンドアプローチよりも優れていることを示します。コードは以下で利用可能です： https://hkchengrex.github.io/Tracking-Anything-with-DEVA

English

Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

分離型ビデオセグメンテーションによる任意物体追跡

Tracking Anything with Decoupled Video Segmentation

要旨

Support