SAM4D：相機與LiDAR流中的任意分割

摘要

我們提出SAM4D，這是一個多模態與時間基礎模型，旨在實現跨攝像頭與LiDAR流的可提示分割。我們引入了統一多模態位置編碼（UMPE），以在共享的三維空間中對齊攝像頭與LiDAR特徵，從而實現無縫的跨模態提示與交互。此外，我們提出了運動感知跨模態記憶注意力機制（MCMA），該機制利用自我運動補償來增強時間一致性與長時序特徵檢索，確保在動態變化的自動駕駛場景中實現穩健的分割。為避免標註瓶頸，我們開發了一個多模態自動化數據引擎，該引擎結合了VFM驅動的視頻掩碼片段、時空四維重建以及跨模態掩碼片段融合。這一框架以比人工標註快數個數量級的速度生成攝像頭-LiDAR對齊的偽標籤，同時在點雲表示中保留了VFM衍生的語義保真度。我們在構建的Waymo-4DSeg上進行了廣泛的實驗，這些實驗展示了SAM4D強大的跨模態分割能力及其在數據標註中的巨大潛力。

English

We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

SAM4D：相機與LiDAR流中的任意分割

SAM4D: Segment Anything in Camera and LiDAR Streams

摘要

Support