ChatPaper.aiChatPaper

视频变形为掩码:基于流匹配的参考视频分割

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

October 7, 2025
作者: Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li, Sizhe Dang, Chengzu Li, Harry Yang, Guang Dai, Mengmeng Wang, Jingdong Wang
cs.AI

摘要

参考视频对象分割(RVOS)旨在通过自然语言描述指导,在视频中分割特定对象。RVOS的核心挑战在于将抽象的语言概念锚定到一组具体的像素上,并在视频的复杂动态中持续分割这些像素。面对这一难题,先前的研究通常将该任务分解为一种实用的“先定位后分割”流程。然而,这种级联设计通过将语义简化为粗略的几何提示(例如点)造成了信息瓶颈,并且由于分割过程往往与初始的语言定位脱节,难以保持时间一致性。为了克服这些根本性限制,我们提出了FlowRVS,一个将RVOS重新构想为条件连续流问题的新框架。这使得我们能够利用预训练T2V模型的固有优势,实现精细的像素控制、文本-视频语义对齐以及时间连贯性。不同于传统的从噪声生成掩码或直接预测掩码,我们通过从视频的整体表示到目标掩码学习一种直接的、语言引导的变形来重新定义任务。我们的一阶段生成方法在所有主要RVOS基准测试中均取得了新的最先进成果。具体而言,在MeViS上实现了51.1的J&F(比之前的最佳结果提高了1.6),在零样本Ref-DAVIS17上达到了73.3(提高了2.7),展示了将视频理解任务建模为连续变形过程的巨大潜力。
English
Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment' pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a J&F of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.
PDF22October 8, 2025