ChatPaper.aiChatPaper

锻造空间智能:面向自主系统的多模态数据预训练路线图

Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

December 30, 2025
作者: Song Wang, Lingdong Kong, Xiaolu Liu, Hao Shi, Wentong Li, Jianke Zhu, Steven C. H. Hoi
cs.AI

摘要

随着自动驾驶汽车和无人机等自主系统的快速发展,从多模态车载传感器数据中构建真正空间智能的需求日益迫切。尽管基础模型在单模态场景中表现出色,但如何整合摄像头与激光雷达等异构传感器的能力以形成统一的环境感知,仍面临严峻挑战。本文提出一个全面的多模态预训练框架,系统梳理了推动该领域进展的核心技术体系。我们深入剖析基础传感器特性与学习策略之间的相互作用,评估特定平台数据集对技术发展的支撑作用。核心贡献在于构建了预训练范式的统一分类体系:从单模态基线方法到学习三维目标检测、语义占据预测等高级任务整体表征的融合框架。此外,我们探索文本输入与占据表征的融合机制,以支持开放世界感知与规划。最后,针对计算效率与模型可扩展性等关键瓶颈,我们提出了实现适用于现实世界部署的通用多模态基础模型的技术路线图。
English
The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.
PDF41January 2, 2026