使用解耦式視頻分割追蹤任何物體

摘要

視頻分割的訓練數據很昂貴且耗時。這阻礙了將端到端算法擴展到新的視頻分割任務，特別是在大語彙設置中。為了在不對每個單獨的任務進行視頻數據訓練的情況下實現“追蹤任何物體”，我們開發了一種分離式視頻分割方法（DEVA），由任務特定的圖像級分割和類/任務不可知的雙向時間傳播組成。由於這種設計，我們只需要針對目標任務的圖像級模型（訓練成本較低）和一個通用的時間傳播模型，後者只需訓練一次即可應用於各種任務。為了有效結合這兩個模塊，我們使用雙向傳播來（半）在線融合來自不同幀的分割假設，以生成一致的分割。我們展示了這種分離式公式在幾個數據稀缺任務中的表現優於端到端方法，包括大語彙視頻全景分割、開放世界視頻分割、參考視頻分割和無監督視頻對象分割。代碼可在以下網址找到：https://hkchengrex.github.io/Tracking-Anything-with-DEVA

English

Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

使用解耦式視頻分割追蹤任何物體

Tracking Anything with Decoupled Video Segmentation

摘要

Support