자기회귀적 범용 비디오 분할 모델

초록

최근 SAM2와 같은 비디오 기반 모델들은 마스크를 범용 프리미티브로 취급하여 프롬프트 기반 비디오 세그멘테이션에서 뛰어난 성능을 보여주고 있습니다. 그러나 실제 세계의 많은 설정에서는 외부 단서 없이 비디오 내의 모든 객체를 감지하고 추적하는 비프롬프트 세그멘테이션이 필요하며, 이로 인해 현재의 상황은 작업별 모델과 파이프라인에 걸쳐 분열된 상태입니다. 우리는 스트리밍 비디오 세그멘테이션을 언어 모델링과 유사한 순차적 마스크 예측으로 재구성하고, 프롬프트 및 비프롬프트 비디오 세그멘테이션을 통합하는 단일 아키텍처인 Autoregressive Universal Segmentation Model (AUSM)을 소개합니다. 최신 상태-공간 모델을 기반으로 구축된 AUSM은 고정 크기의 공간 상태를 유지하며 임의 길이의 비디오 스트림에 확장 가능합니다. 또한, AUSM의 모든 구성 요소는 프레임 간 병렬 학습을 위해 설계되어 반복적 학습 대비 상당한 속도 향상을 제공합니다. 표준 벤치마크(DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, OVIS)에서 AUSM은 기존의 범용 스트리밍 비디오 세그멘테이션 방법들을 능가하며, 16프레임 시퀀스에서 최대 2.5배 빠른 학습 속도를 달성했습니다.

English

Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today's landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.

자기회귀적 범용 비디오 분할 모델

Autoregressive Universal Video Segmentation Model

초록

Support