自回归通用视频分割模型

摘要

近期如SAM2等视频基础模型在处理提示视频分割任务时表现出色，它们将掩码视为通用原语。然而，许多实际应用场景需要无提示分割，即无需外部线索就能检测并追踪视频中的所有对象，这使得当前领域被分割为多个特定任务模型和流程。我们将流式视频分割重新定义为序列掩码预测，类似于语言建模，并引入了自回归通用分割模型（AUSM），这一单一架构统一了提示和无提示视频分割。基于最新的状态空间模型，AUSM维持固定大小的空间状态，并能适应任意长度的视频流。此外，AUSM的所有组件均设计为跨帧并行训练，相较于迭代训练实现了显著的加速。在标准基准测试（DAVIS17、YouTube-VOS 2018 & 2019、MOSE、YouTube-VIS 2019 & 2021以及OVIS）中，AUSM超越了先前的通用流式视频分割方法，并在16帧序列上实现了高达2.5倍的训练速度提升。

English

Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today's landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.

自回归通用视频分割模型

Autoregressive Universal Video Segmentation Model

摘要

Support