3AM:基于几何一致性的视频通用分割
3AM: Segment Anything with Geometric Consistency in Videos
January 13, 2026
作者: Yang-Che Sun, Cheng Sun, Chin-Yang Lin, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu
cs.AI
摘要
诸如SAM2等视频目标分割方法通过基于记忆的架构实现了强劲性能,但在视角剧烈变化时因依赖外观特征而表现不佳。传统3D实例分割方法虽能保持视角一致性,但需要相机位姿、深度图及昂贵的预处理流程。我们提出3AM这一训练时增强方案,将MUSt3R的3D感知特征集成至SAM2中。通过轻量级特征融合器,我们融合了MUStR3的多层级特征——这些特征编码了隐式几何对应关系。结合SAM2的外观特征,该模型实现了基于空间位置与视觉相似度的几何一致性识别。我们提出视场感知采样策略,确保帧序列观测空间一致的目标区域,从而建立可靠的3D对应学习机制。关键的是,本方法在推理时仅需RGB输入,无需相机位姿或预处理。在宽基线运动挑战数据集(ScanNet++、Replica)上,3AM显著超越SAM2及其扩展版本,在ScanNet++精选子集上分别达到90.6%交并比和71.7%正向交并比,较当前最优视频目标分割方法提升15.9和30.4个百分点。项目页面:https://jayisaking.github.io/3AM-Page/
English
Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/