SeC：通過漸進式概念構建推進複雜視頻對象分割

摘要

視頻對象分割（Video Object Segmentation, VOS）是計算機視覺中的一項核心任務，要求模型能夠在視頻幀間追蹤並分割目標對象。儘管近期研究取得了顯著進展，現有技術在處理劇烈的視覺變化、遮擋及複雜場景變換時仍遜色於人類能力。這一侷限性源於它們依賴於外觀匹配，而忽視了人類對物件的概念性理解，這種理解能夠在時間動態中實現穩健的識別。基於這一差距，我們提出了Segment Concept（SeC），這是一個概念驅動的分割框架，它從傳統的特徵匹配轉向逐步構建並利用高層次、以對象為中心的表徵。SeC利用大型視覺-語言模型（Large Vision-Language Models, LVLMs）整合跨多樣幀的視覺線索，構建穩健的概念先驗。在推理過程中，SeC基於已處理幀形成目標的全面語義表徵，實現對後續幀的穩健分割。此外，SeC自適應地平衡了基於LVLM的語義推理與增強的特徵匹配，根據場景複雜度動態調整計算投入。為了嚴格評估在需要高層次概念推理和穩健語義理解場景下的VOS方法，我們引入了語義複雜場景視頻對象分割基準（Semantic Complex Scenarios Video Object Segmentation benchmark, SeCVOS）。SeCVOS包含160個手工標註的多場景視頻，旨在通過顯著的外觀變化和動態場景轉換來挑戰模型。特別地，SeC在SeCVOS上相比SAM 2.1提升了11.8個百分點，在概念感知的視頻對象分割領域樹立了新的標杆。

English

Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.