SeC：通过渐进式概念构建推进复杂视频对象分割

摘要

视频目标分割（VOS）是计算机视觉中的核心任务，要求模型在视频帧间追踪并分割目标对象。尽管近期研究取得了显著进展，现有技术在应对剧烈视觉变化、遮挡及复杂场景转换时仍逊色于人类能力。这一局限源于其依赖外观匹配，而忽视了人类对对象的概念理解，这种理解能在时间动态中实现稳健识别。受此差距启发，我们提出了分段概念（SeC），一个概念驱动的分割框架，它从传统的特征匹配转向逐步构建和利用高层次、以对象为中心的表示。SeC采用大型视觉语言模型（LVLMs）整合跨帧视觉线索，构建稳健的概念先验。在推理过程中，SeC基于处理过的帧形成目标的全面语义表示，实现对后续帧的稳健分割。此外，SeC自适应地平衡基于LVLM的语义推理与增强的特征匹配，根据场景复杂度动态调整计算投入。为严格评估在需要高级概念推理和稳健语义理解场景下的VOS方法，我们引入了语义复杂场景视频目标分割基准（SeCVOS）。SeCVOS包含160个手工标注的多场景视频，旨在通过显著的外观变化和动态场景转换挑战模型。特别地，SeC在SeCVOS上相比SAM 2.1提升了11.8个百分点，确立了概念感知视频目标分割的新标杆。

English

Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.