時間提示的重要性:重新審視參考視頻對象分割
Temporal Prompting Matters: Rethinking Referring Video Object Segmentation
October 8, 2025
作者: Ci-Siang Lin, Min-Hung Chen, I-Jieh Liu, Chien-Yi Wang, Sifei Liu, Yu-Chiang Frank Wang
cs.AI
摘要
參考視頻對象分割(RVOS)旨在根據查詢句子分割視頻中提及的對象。現有的大多數方法需要依賴密集掩碼註釋進行端到端訓練,這可能耗費大量計算資源且擴展性較差。在本研究中,我們重新審視了RVOS問題,並致力於探討該任務的關鍵所在。基於現有的基礎分割模型,我們將RVOS任務分解為參考、視頻和分割三個要素,並提出了一種時序提示生成與選擇(Tenet)框架來處理參考和視頻要素,而將分割問題留給基礎模型解決。為了有效地將基於圖像的基礎分割模型適應於參考視頻對象分割,我們利用現成的對象檢測器和跟蹤器生成與參考句子相關的時序提示。雖然高質量的時序提示可以被生成,但從置信度分數中難以輕易識別它們。為解決這一問題,我們提出了提示偏好學習來評估生成的時序提示的質量。通過利用這些提示來指導基於圖像的基礎分割模型,我們能夠為參考對象生成高質量的掩碼,從而實現模型對參考視頻對象分割的高效適應。在RVOS基準測試上的實驗證明了Tenet框架的有效性。
English
Referring Video Object Segmentation (RVOS) aims to segment the object
referred to by the query sentence in the video. Most existing methods require
end-to-end training with dense mask annotations, which could be
computation-consuming and less scalable. In this work, we rethink the RVOS
problem and aim to investigate the key to this task. Based on existing
foundation segmentation models, we decompose the RVOS task into referring,
video, and segmentation factors, and propose a Temporal Prompt Generation and
Selection (Tenet) framework to address the referring and video factors while
leaving the segmentation problem to foundation models. To efficiently adapt
image-based foundation segmentation models to referring video object
segmentation, we leverage off-the-shelf object detectors and trackers to
produce temporal prompts associated with the referring sentence. While
high-quality temporal prompts could be produced, they can not be easily
identified from confidence scores. To tackle this issue, we propose Prompt
Preference Learning to evaluate the quality of the produced temporal prompts.
By taking such prompts to instruct image-based foundation segmentation models,
we would be able to produce high-quality masks for the referred object,
enabling efficient model adaptation to referring video object segmentation.
Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet
framework.