ChatPaper.aiChatPaper

分段任意模型家族中的SAM2至SAM3代际差距:为何基于提示的专业能力在概念驱动图像分割中失效 注:标题采用学术论文常见的"问题揭示型"句式,将英文的疑问句式转化为中文更常用的"为何..."论述句式。"Prompt-Based Expertise"译为"基于提示的专业能力"以体现模型对提示的响应能力,"Concept-Driven"遵循学界通用的"概念驱动"译法。

The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

December 4, 2025
作者: Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee
cs.AI

摘要

本文深入探討了最新兩代分割萬物模型(SAM2與SAM3)之間的根本性斷層。我們闡釋了為何SAM2基於提示的分割專業知識無法遷移至SAM3的多模態概念驅動範式。SAM2通過空間提示(點、框、掩碼)進行操作,產生純幾何與時序的分割結果;而SAM3則引入了統一的視覺-語言架構,具備開放詞彙推理、語義接地、對比對齊及範例驅動的概念理解能力。本文通過五大核心組件展開分析:(1)提示驅動與概念驅動分割的理論斷層,對比SAM2的空間提示語義與SAM3的多模態融合及文本條件掩碼生成;(2)架構分歧,詳述SAM2的純視覺-時序設計與SAM3整合視覺-語言編碼器、幾何與範例編碼器、融合模塊、DETR風格解碼器、物件查詢及專家混合機制的模糊處理能力;(3)數據集與標註差異,對比SAM2的SA-V視頻掩碼與SAM3的多模態概念標註語料庫;(4)訓練與超參數區別,揭示SAM2優化知識為何不適用於SAM3;(5)評估指標與失效模式,勾勒從幾何IoU指標到語義化開放詞彙評估的轉變。這些分析共同確立SAM3作為新一代分割基礎模型的地位,並為新興的概念驅動分割時代規劃未來發展路徑。
English
This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus integration of vision-language encoders, geometry and exemplar encoders, fusion modules, DETR-style decoders, object queries, and ambiguity-handling via Mixture-of-Experts in SAM3; (3) Dataset and Annotation Differences, contrasting SA-V video masks with multimodal concept-annotated corpora of SAM3; (4) Training and Hyperparameter Distinctions, showing why SAM2 optimization knowledge does not apply to SAM3; and (5) Evaluation, Metrics, and Failure Modes, outlining the transition from geometric IoU metrics to semantic, open-vocabulary evaluation. Together, these analyses establish SAM3 as a new class of segmentation foundation model and chart future directions for the emerging concept-driven segmentation era.
PDF02December 10, 2025