ChatPaper.aiChatPaper

VideoMaMa:基于生成先验的掩码引导视频抠图技术

VideoMaMa: Mask-Guided Video Matting via Generative Prior

January 20, 2026
作者: Sangbeom Lim, Seoung Wug Oh, Jiahui Huang, Heeji Yoon, Seungryong Kim, Joon-Young Lee
cs.AI

摘要

由于标注数据稀缺,视频抠图模型在真实世界视频中的泛化能力仍面临重大挑战。为此,我们提出视频掩码转蒙版模型(VideoMaMa),通过利用预训练视频扩散模型,将粗粒度分割掩码转化为像素级精确的阿尔法蒙版。尽管仅使用合成数据训练,VideoMaMa在真实影像上展现出强大的零样本泛化能力。基于此能力,我们开发了可扩展的大规模视频抠图伪标注流程,并构建了视频通用抠图数据集(MA-V),该数据集为超过5万段涵盖多样场景与运动的真实视频提供了高质量抠图标注。为验证数据集有效性,我们在MA-V上对SAM2模型进行微调得到SAM2-Matte模型,其在真实场景视频上的鲁棒性优于基于现有抠图数据集训练的同类模型。这些发现凸显了大规模伪标注视频抠图的重要性,并展示了生成先验与易得分割线索如何推动视频抠图研究的可扩展进展。
English
Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.
PDF81January 24, 2026