VidEoMT: あなたのViTは密かに映像セグメンテーションモデルでもある

要旨

既存のオンライン動画セグメンテーションモデルは、通常、フレーム単位のセグメンターと複雑な専用トラッキングモジュールを組み合わせた構成を採っている。これらのモジュールは有効ではあるが、構造的な複雑さと計算コストの大幅な増加をもたらす。最近の研究では、十分な容量と大規模な事前学習を備えたプレーンなVision Transformer（ViT）エンコーダーが、特殊なモジュールを必要とせずに正確な画像セグメンテーションを実行できることが示されている。この知見に基づき、我々は専用のトラッキングモジュールを不要とするシンプルなエンコーダー専用動画セグメンテーションモデル、Video Encoder-only Mask Transformer（VidEoMT）を提案する。エンコーダー専用ViTにおける時間的モデリングを実現するため、VidEoMTは軽量なクエリ伝播メカニズムを導入し、前フレームのクエリを再利用することでフレーム間の情報伝達を行う。さらに、新規コンテンツへの適応性とのバランスを図るため、伝播されたクエリと時間的に不変な学習済みクエリ群を組み合わせるクエリ融合戦略を採用する。その結果、VidEoMTは複雑さを追加することなくトラッカーの利点を獲得し、ViT-Lバックボーンで最大160 FPSを達成しつつ、競争力のある精度を5倍から10倍高速に実現する。コード：https://www.tue-mps.org/videomt/

English

Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x--10x faster, running at up to 160 FPS with a ViT-L backbone. Code: https://www.tue-mps.org/videomt/

VidEoMT: あなたのViTは密かに映像セグメンテーションモデルでもある

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

要旨

Support