SAM4D: Segmenteren van alles in camera- en LiDAR-streams

Samenvatting

We presenteren SAM4D, een multi-modale en temporele foundation model ontworpen voor promptbare segmentatie over camera- en LiDAR-streams. Unified Multi-modal Positional Encoding (UMPE) wordt geïntroduceerd om camera- en LiDAR-features uit te lijnen in een gedeelde 3D-ruimte, wat naadloze cross-modale prompting en interactie mogelijk maakt. Daarnaast stellen we Motion-aware Cross-modal Memory Attention (MCMA) voor, dat gebruikmaakt van ego-motion compensatie om temporele consistentie en lange-termijn feature retrieval te verbeteren, wat robuuste segmentatie garandeert in dynamisch veranderende autonome rijscènes. Om annotatieknelpunten te vermijden, ontwikkelen we een multi-modale geautomatiseerde data-engine die VFM-gestuurde video masklets, spatiotemporele 4D-reconstructie en cross-modale masklet-fusie combineert. Dit framework genereert camera-LiDAR-uitgelijnde pseudo-labels met een snelheid die ordes van grootte sneller is dan menselijke annotatie, terwijl de semantische trouw afgeleid van VFM behouden blijft in point cloud representaties. We voeren uitgebreide experimenten uit op het geconstrueerde Waymo-4DSeg, die de krachtige cross-modale segmentatiecapaciteit en het grote potentieel in data-annotatie van het voorgestelde SAM4D aantonen.

English

We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

SAM4D: Segmenteren van alles in camera- en LiDAR-streams

SAM4D: Segment Anything in Camera and LiDAR Streams

Samenvatting

Support