StyleVLA：面向自动驾驶的驾驶风格感知视觉语言动作模型

摘要

视觉语言模型（VLM）架起了视觉感知与语言推理之间的桥梁。在自动驾驶领域，这一融合催生了视觉语言动作模型，能够将高层次多模态理解转化为驾驶行为（通常表现为未来轨迹）。然而，现有VLA模型主要生成通用的避撞轨迹。除避免碰撞外，适应多样化驾驶风格（如运动型、舒适型）对实现个性化驾驶至关重要。此外，许多方法将轨迹生成简化为简单的词元预测，可能产生运动学上不可行的动作。针对这些局限，我们提出StyleVLA——一个融合物理知识的VLA框架，用于生成多样化且符合物理规律的驾驶行为。我们引入结合运动学一致性约束与连续回归头的混合损失函数以提升轨迹可行性。基于Qwen3-VL-4B构建的StyleVLA使用包含1200余个场景、7.6万组鸟瞰图样本和4.2万组第一人称视角样本的大规模指令数据集进行训练，其中包含五种驾驶风格的真值轨迹及自然语言指令。实验表明，我们的40亿参数StyleVLA显著优于专有模型和前沿VLA模型。在综合评估成功率、物理可行性与风格遵从度的驾驶评分中，StyleVLA在鸟瞰图和第一人称视角下分别获得0.55和0.51分，而Gemini-3-Pro仅为0.32和0.35分。这些结果证明，专业化、融合物理知识的轻量化模型能在特定领域任务中超越闭源模型。

English

Vision Language Models (VLMs) bridge visual perception and linguistic reasoning. In Autonomous Driving (AD), this synergy has enabled Vision Language Action (VLA) models, which translate high-level multimodal understanding into driving behaviors, typically represented as future trajectories. However, existing VLA models mainly generate generic collision-free trajectories. Beyond collision avoidance, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized driving. Moreover, many methods treat trajectory generation as naive token prediction, which can produce kinematically infeasible actions. To address these limitations, we present StyleVLA, a physics-informed VLA framework for generating diverse and physically plausible driving behaviors. We introduce a hybrid loss that combines a kinematic consistency constraint with a continuous regression head to improve trajectory feasibility. To train StyleVLA, built on Qwen3-VL-4B, we construct a large-scale instruction dataset with over 1.2k scenarios, 76k Bird's Eye View (BEV) samples, and 42k First Person View (FPV) samples, with ground-truth trajectories for five driving styles and natural-language instructions. Experiments show that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA models. Using a composite driving score measuring success rate, physical feasibility, and style adherence, StyleVLA achieves 0.55 on BEV and 0.51 on FPV, versus 0.32 and 0.35 for Gemini-3-Pro. These results show that a specialized, physics-informed, lightweight model can surpass closed-source models on domain-specific tasks.