StyleVLA：面向自动驾驶的驾驶风格感知视觉语言动作模型

摘要

视觉语言模型（VLMs）成功搭建了视觉感知与语言推理之间的桥梁。在自动驾驶领域，这种协同作用催生了视觉语言行动模型，能够将高层次多模态理解转化为驾驶行为（通常表现为未来轨迹）。然而现有VLA模型主要生成通用的无碰撞轨迹，除避障外，适应多样化驾驶风格（如运动型、舒适型）对个性化驾驶至关重要。此外，许多方法将轨迹生成简单视为令牌预测，可能导致运动学上不可行的动作。为突破这些局限，我们提出StyleVLA——一个融合物理知识的VLA框架，用于生成多样化且符合物理规律的驾驶行为。我们设计了结合运动学一致性约束与连续回归头的混合损失函数以提升轨迹可行性。基于Qwen3-VL-4B构建的StyleVLA使用大规模指令数据集进行训练，该数据集包含1.2万个场景、7.6万个鸟瞰图样本及4.2万第一人称视角样本，并标注了五种驾驶风格的真值轨迹与自然语言指令。实验表明，我们的40亿参数StyleVLA显著优于专有模型（如Gemini-3-Pro）及前沿VLA模型。采用综合评估成功率、物理可行性与风格遵从度的驾驶评分体系，StyleVLA在BEV和FPV上分别获得0.55与0.51分，而Gemini-3-Pro仅为0.32与0.35分。这证明专业化、融合物理知识的轻量化模型能在特定领域任务中超越闭源模型。

English

Vision Language Models (VLMs) bridge visual perception and linguistic reasoning. In Autonomous Driving (AD), this synergy has enabled Vision Language Action (VLA) models, which translate high-level multimodal understanding into driving behaviors, typically represented as future trajectories. However, existing VLA models mainly generate generic collision-free trajectories. Beyond collision avoidance, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized driving. Moreover, many methods treat trajectory generation as naive token prediction, which can produce kinematically infeasible actions. To address these limitations, we present StyleVLA, a physics-informed VLA framework for generating diverse and physically plausible driving behaviors. We introduce a hybrid loss that combines a kinematic consistency constraint with a continuous regression head to improve trajectory feasibility. To train StyleVLA, built on Qwen3-VL-4B, we construct a large-scale instruction dataset with over 1.2k scenarios, 76k Bird's Eye View (BEV) samples, and 42k First Person View (FPV) samples, with ground-truth trajectories for five driving styles and natural-language instructions. Experiments show that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA models. Using a composite driving score measuring success rate, physical feasibility, and style adherence, StyleVLA achieves 0.55 on BEV and 0.51 on FPV, versus 0.32 and 0.35 for Gemini-3-Pro. These results show that a specialized, physics-informed, lightweight model can surpass closed-source models on domain-specific tasks.

StyleVLA：面向自动驾驶的驾驶风格感知视觉语言动作模型

StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving

摘要

Support