VidEoMT：你的ViT暗藏视频分割模型之能

摘要

现有在线视频分割模型通常将逐帧分割器与复杂的专用跟踪模块相结合。虽然有效，但这些模块会带来显著的架构复杂性和计算开销。近期研究表明，当具备足够容量并经过大规模预训练后，纯视觉Transformer编码器无需专用模块即可实现精确的图像分割。受此启发，我们提出纯编码器视频掩码Transformer模型，这是一种无需专用跟踪模块的简易编码器架构视频分割方案。为实现纯编码器ViT中的时序建模，该模型引入了轻量级查询传播机制，通过复用前一帧的查询实现跨帧信息传递。为平衡新内容适应性，模型采用查询融合策略，将传播查询与一组时序无关的学习查询相结合。由此，该模型在无需增加复杂度的前提下获得了跟踪器优势，在保持竞争力的精度同时实现5-10倍加速，基于ViT-L骨干网络最高可达160 FPS。代码地址：https://www.tue-mps.org/videomt/

English

Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x--10x faster, running at up to 160 FPS with a ViT-L backbone. Code: https://www.tue-mps.org/videomt/