VidEoMT：你的ViT模型其實也是影片分割模型

摘要

現有的線上影片分割模型通常結合逐幀分割器與複雜的專用追蹤模組。儘管效果顯著，這些模組卻帶來顯著的架構複雜性與計算負擔。近期研究表明，當具備足夠容量並進行大規模預訓練時，純視覺Transformer（ViT）編碼器無需專用模組即可實現精確的影像分割。受此啟發，我們提出純編碼器影片遮罩Transformer（VidEoMT），這款簡潔的純編碼器影片分割模型無需專用追蹤模組。為在純編碼器ViT中實現時序建模，VidEoMT引入輕量級查詢傳播機制，通過重用前一幀的查詢來跨幀傳遞資訊。為平衡此機制與對新內容的適應性，模型採用查詢融合策略，將傳播查詢與一組時序無關的學習查詢相結合。由此，VidEoMT在無需增加複雜度的前提下獲得追蹤器優勢，在實現競爭性精確度的同時速度提升5–10倍，搭配ViT-L骨幹網路時最高可達160 FPS。程式碼：https://www.tue-mps.org/videomt/

English

Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x--10x faster, running at up to 160 FPS with a ViT-L backbone. Code: https://www.tue-mps.org/videomt/

VidEoMT：你的ViT模型其實也是影片分割模型

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

摘要

Support