GASP:统一几何与语义自监督预训练,助力自动驾驶
GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving
March 19, 2025
作者: William Ljungbergh, Adam Lilja, Adam Tonderski. Arvid Laveno Ling, Carl Lindström, Willem Verbeke, Junsheng Fu, Christoffer Petersson, Lars Hammarstrand, Michael Felsberg
cs.AI
摘要
基于下一词预测的自监督预训练使大型语言模型能够捕捉文本的底层结构,并在大规模应用时,在众多任务上取得了前所未有的性能表现。类似地,自动驾驶生成了海量的时空数据,暗示了利用规模学习环境及其随时间演变的几何与语义结构的可能性。沿着这一方向,我们提出了一种几何与语义自监督预训练方法——GASP,该方法通过预测任意查询的未来时空点上的(1)一般占据情况,捕捉三维场景的演变结构;(2)自车占据情况,模拟自车在环境中的路径;以及(3)从视觉基础模型中蒸馏出的高层特征,来学习统一表示。通过建模几何与语义的四维占据场而非原始传感器测量值,模型学习到了环境及其随时间演变的结构化、可泛化表示。我们在多个自动驾驶基准上验证了GASP,展示了在语义占据预测、在线建图和自车轨迹预测方面的显著提升。我们的结果表明,连续的四维几何与语义占据预测为自动驾驶提供了一个可扩展且有效的预训练范式。代码及更多可视化内容,请访问\href{https://research.zenseact.com/publications/gasp/}。
English
Self-supervised pre-training based on next-token prediction has enabled large
language models to capture the underlying structure of text, and has led to
unprecedented performance on a large array of tasks when applied at scale.
Similarly, autonomous driving generates vast amounts of spatiotemporal data,
alluding to the possibility of harnessing scale to learn the underlying
geometric and semantic structure of the environment and its evolution over
time. In this direction, we propose a geometric and semantic self-supervised
pre-training method, GASP, that learns a unified representation by predicting,
at any queried future point in spacetime, (1) general occupancy, capturing the
evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle
path through the environment; and (3) distilled high-level features from a
vision foundation model. By modeling geometric and semantic 4D occupancy fields
instead of raw sensor measurements, the model learns a structured,
generalizable representation of the environment and its evolution through time.
We validate GASP on multiple autonomous driving benchmarks, demonstrating
significant improvements in semantic occupancy forecasting, online mapping, and
ego trajectory prediction. Our results demonstrate that continuous 4D geometric
and semantic occupancy prediction provides a scalable and effective
pre-training paradigm for autonomous driving. For code and additional
visualizations, see \href{https://research.zenseact.com/publications/gasp/.Summary
AI-Generated Summary