Vega：基于自然语言指令的自动驾驶学习系统

摘要

视觉-语言-动作模型通过将语言融入决策过程，重塑了自动驾驶技术框架。然而，现有技术方案大多仅将语言模态用于场景描述或推理，缺乏遵循多样化用户指令实现个性化驾驶的灵活性。为此，我们首先构建了大规模驾驶数据集InstructScene，包含约10万个标注有差异化驾驶指令及对应轨迹的场景。继而提出统一化的视觉-语言-世界-动作模型Vega，实现基于指令的轨迹生成与规划。我们采用自回归范式处理视觉输入（视觉）与语言指令（语言），通过扩散范式生成未来预测（世界建模）与轨迹规划（动作）。通过联合注意力机制实现多模态间交互，并为不同模态配置独立投影层以增强模型能力。大量实验表明，该方法不仅实现了卓越的规划性能，更展现出强大的指令遵循能力，为构建更智能、个性化的驾驶系统开辟了新路径。

English

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

Vega：基于自然语言指令的自动驾驶学习系统

Vega: Learning to Drive with Natural Language Instructions

摘要

Support