ChatPaper.aiChatPaper

3AM:基於幾何一致性的影片通用分割技術

3AM: Segment Anything with Geometric Consistency in Videos

January 13, 2026
作者: Yang-Che Sun, Cheng Sun, Chin-Yang Lin, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu
cs.AI

摘要

諸如SAM2等影片物件分割方法雖透過記憶體架構實現優異效能,但在視角劇烈變化時會因依賴外觀特徵而表現不佳。傳統3D實例分割方法雖能解決視角一致性问题,但需相機姿態、深度圖及昂貴的預處理流程。我們提出3AM——一種訓練階段的增強技術,將MUSt3R的3D感知特徵整合至SAM2中。透過輕量級特徵融合器,我們合成了編碼隱式幾何對應關係的多層級MUSt3R特徵。結合SAM2的外觀特徵,該模型能基於空間位置與視覺相似性實現幾何一致性識別。我們提出視野感知取樣策略,確保幀序列觀測到空間一致的物件區域,從而實現可靠的3D對應學習。關鍵在於,本方法在推理階段僅需RGB輸入,無需相機姿態或預處理。在具備寬基線運動的挑戰性資料集(ScanNet++、Replica)上,3AM顯著超越SAM2及其擴展版本,於ScanNet++精選子集達成90.6% IoU與71.7% Positive IoU,較頂尖影片物件分割方法提升15.9與30.4個百分點。專案頁面:https://jayisking.github.io/3AM-Page/
English
Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/
PDF211January 15, 2026