SAM4D：相机与激光雷达流中的任意目标分割

摘要

我们推出了SAM4D，这是一个多模态时序基础模型，专为跨摄像头与激光雷达数据流的可提示分割而设计。通过引入统一多模态位置编码（UMPE），实现了摄像头与激光雷达特征在共享三维空间中的对齐，从而支持无缝的跨模态提示与交互。此外，我们提出了运动感知跨模态记忆注意力机制（MCMA），该机制利用自运动补偿来增强时序一致性及长时程特征检索能力，确保在动态变化的自动驾驶场景中实现稳健的分割。为规避标注瓶颈，我们开发了一套多模态自动化数据引擎，该引擎融合了视觉基础模型（VFM）驱动的视频片段掩码、时空四维重建以及跨模态片段掩码融合技术。这一框架以远超人工标注的速度生成摄像头与激光雷达对齐的伪标签，同时保持了点云表示中源自VFM的语义保真度。我们在构建的Waymo-4DSeg数据集上进行了大量实验，验证了SAM4D在跨模态分割能力上的强大表现及其在数据标注方面的巨大潜力。

English

We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

SAM4D：相机与激光雷达流中的任意目标分割

SAM4D: Segment Anything in Camera and LiDAR Streams

摘要

Support