自回归通用视频分割模型
Autoregressive Universal Video Segmentation Model
August 26, 2025
作者: Miran Heo, Sukjun Hwang, Min-Hung Chen, Yu-Chiang Frank Wang, Albert Gu, Seon Joo Kim, Ryo Hachiuma
cs.AI
摘要
近期如SAM2等视频基础模型在处理提示视频分割任务时表现出色,它们将掩码视为通用原语。然而,许多实际应用场景需要无提示分割,即无需外部线索就能检测并追踪视频中的所有对象,这使得当前领域被分割为多个特定任务模型和流程。我们将流式视频分割重新定义为序列掩码预测,类似于语言建模,并引入了自回归通用分割模型(AUSM),这一单一架构统一了提示和无提示视频分割。基于最新的状态空间模型,AUSM维持固定大小的空间状态,并能适应任意长度的视频流。此外,AUSM的所有组件均设计为跨帧并行训练,相较于迭代训练实现了显著的加速。在标准基准测试(DAVIS17、YouTube-VOS 2018 & 2019、MOSE、YouTube-VIS 2019 & 2021以及OVIS)中,AUSM超越了先前的通用流式视频分割方法,并在16帧序列上实现了高达2.5倍的训练速度提升。
English
Recent video foundation models such as SAM2 excel at prompted video
segmentation by treating masks as a general-purpose primitive. However, many
real-world settings require unprompted segmentation that aims to detect and
track all objects in a video without external cues, leaving today's landscape
fragmented across task-specific models and pipelines. We recast streaming video
segmentation as sequential mask prediction, analogous to language modeling, and
introduce the Autoregressive Universal Segmentation Model (AUSM), a single
architecture that unifies both prompted and unprompted video segmentation.
Built on recent state-space models, AUSM maintains a fixed-size spatial state
and scales to video streams of arbitrary length. Furthermore, all components of
AUSM are designed for parallel training across frames, yielding substantial
speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS
2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior
universal streaming video segmentation methods and achieves up to 2.5x faster
training on 16-frame sequences.