自己回帰型ユニバーサルビデオセグメンテーションモデル

要旨

最近のビデオ基盤モデル、例えばSAM2は、マスクを汎用プリミティブとして扱うことで、プロンプト付きビデオセグメンテーションにおいて優れた性能を発揮しています。しかし、多くの現実世界の設定では、外部の手がかりなしにビデオ内のすべてのオブジェクトを検出し追跡することを目的とした、プロンプトなしのセグメンテーションが必要とされており、現在の状況はタスク固有のモデルやパイプラインに分散したままです。我々は、ストリーミングビデオセグメンテーションを言語モデリングに類似した逐次マスク予測として再定義し、プロンプト付きおよびプロンプトなしのビデオセグメンテーションを統合する単一のアーキテクチャであるAutoregressive Universal Segmentation Model (AUSM)を導入します。最近の状態空間モデルに基づいて構築されたAUSMは、固定サイズの空間状態を維持し、任意の長さのビデオストリームにスケールします。さらに、AUSMのすべてのコンポーネントはフレーム間での並列トレーニングを可能にするように設計されており、反復トレーニングに比べて大幅な高速化を実現しています。標準ベンチマーク（DAVIS17、YouTube-VOS 2018 & 2019、MOSE、YouTube-VIS 2019 & 2021、およびOVIS）において、AUSMは従来のユニバーサルストリーミングビデオセグメンテーション手法を上回り、16フレームシーケンスでのトレーニング速度を最大2.5倍向上させました。

English

Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today's landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.

自己回帰型ユニバーサルビデオセグメンテーションモデル

Autoregressive Universal Video Segmentation Model

要旨

Support