StyleVLA: 자율 주행을 위한 주행 스타일 인식 비전 언어 액션 모델

초록

비전 언어 모델(VLM)은 시각적 인식과 언어적 추론을 연결합니다. 자율 주행(AD) 분야에서는 이러한 시너지 효과를 통해 고수준의 다중 모달 이해를 주행 행동(일반적으로 미래 궤적으로 표현됨)으로 변환하는 비전 언어 행동(VLA) 모델이 가능해졌습니다. 그러나 기존 VLA 모델은 주로 일반적인 충돌 회피 궤적을 생성합니다. 충돌 회피를 넘어 다양한 주행 스타일(예: 스포티함, 편안함)에 적응하는 것은 개인 맞춤형 주행에 필수적입니다. 또한 많은 방법론들이 궤적 생성을 단순한 토큰 예측으로 취급하여 운동학적으로 실현 불가능한 행동을 생성할 수 있습니다. 이러한 한계를 해결하기 위해 본 논문은 다양하고 물리적으로 타당한 주행 행동을 생성하기 위한 물리 정보 기반 VLA 프레임워크인 StyleVLA를 제시합니다. 우리는 운동학적 일관성 제약 조건과 연속 회귀 헤드를 결합한 하이브리드 손실 함수를 도입하여 궤적의 실현 가능성을 향상시킵니다. Qwen3-VL-4B를 기반으로 구축된 StyleVLA를 학습시키기 위해, 5가지 주행 스타일과 자연어 명령어에 대한 실제 궤적 데이터가 포함된 1,200개 이상의 시나리오, 76,000개의 조감도(BEV) 샘플, 42,000개의 1인칭 시점(FPV) 샘플로 구성된 대규모 명령어 데이터셋을 구축했습니다. 실험 결과, 40억 개의 파라미터를 가진 우리의 StyleVLA는 사적 모델(예: Gemini-3-Pro)과 최첨단 VLA 모델들을 크게 능가하는 것으로 나타났습니다. 성공률, 물리적 실현 가능성, 스타일 준수도를 측정하는 복합 주행 점수를 사용했을 때, StyleVLA는 BEV에서 0.55, FPV에서 0.51을 달성한 반면 Gemini-3-Pro는 각각 0.32와 0.35를 기록했습니다. 이러한 결과는 특화된, 물리 정보 기반의 경량 모델이 도메인 특화 작업에서 폐쇄형 모델을 능가할 수 있음을 보여줍니다.

English

Vision Language Models (VLMs) bridge visual perception and linguistic reasoning. In Autonomous Driving (AD), this synergy has enabled Vision Language Action (VLA) models, which translate high-level multimodal understanding into driving behaviors, typically represented as future trajectories. However, existing VLA models mainly generate generic collision-free trajectories. Beyond collision avoidance, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized driving. Moreover, many methods treat trajectory generation as naive token prediction, which can produce kinematically infeasible actions. To address these limitations, we present StyleVLA, a physics-informed VLA framework for generating diverse and physically plausible driving behaviors. We introduce a hybrid loss that combines a kinematic consistency constraint with a continuous regression head to improve trajectory feasibility. To train StyleVLA, built on Qwen3-VL-4B, we construct a large-scale instruction dataset with over 1.2k scenarios, 76k Bird's Eye View (BEV) samples, and 42k First Person View (FPV) samples, with ground-truth trajectories for five driving styles and natural-language instructions. Experiments show that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA models. Using a composite driving score measuring success rate, physical feasibility, and style adherence, StyleVLA achieves 0.55 on BEV and 0.51 on FPV, versus 0.32 and 0.35 for Gemini-3-Pro. These results show that a specialized, physics-informed, lightweight model can surpass closed-source models on domain-specific tasks.

StyleVLA: 자율 주행을 위한 주행 스타일 인식 비전 언어 액션 모델

StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving

초록

Support