ChatPaper.aiChatPaper

时序提示至关重要:重新思考视频目标分割中的指代问题

Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

October 8, 2025
作者: Ci-Siang Lin, Min-Hung Chen, I-Jieh Liu, Chien-Yi Wang, Sifei Liu, Yu-Chiang Frank Wang
cs.AI

摘要

参考视频对象分割(RVOS)旨在根据查询语句分割视频中指定的目标对象。现有方法大多需要依赖密集掩码标注进行端到端训练,这往往计算成本高且扩展性有限。本研究重新审视了RVOS问题,致力于探索其核心解决之道。基于现有的基础分割模型,我们将RVOS任务分解为参考、视频和分割三个要素,并提出了一个时序提示生成与选择(Tenet)框架,专门处理参考和视频要素,而将分割问题交由基础模型解决。为了高效地将基于图像的基础分割模型适配于参考视频对象分割任务,我们利用现成的目标检测器和跟踪器生成与参考语句相关联的时序提示。尽管能够生成高质量的时序提示,但仅凭置信度分数难以轻易识别其优劣。为此,我们提出了提示偏好学习机制,用于评估生成的时序提示的质量。通过将这些提示用于指导基于图像的基础分割模型,我们能够为目标对象生成高质量的掩码,从而实现模型向参考视频对象分割任务的高效适配。在RVOS基准测试上的实验验证了Tenet框架的有效性。
English
Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.
PDF22October 13, 2025