DynamicVLA:面向動態物件操作的視覺-語言-動作模型
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
January 29, 2026
作者: Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, Ziwei Liu
cs.AI
摘要
操作動態物體對視覺-語言-動作模型而言仍是開放性挑戰。儘管這類模型在靜態操作任務中展現出強大的泛化能力,但在需要快速感知、時間預測與連續控制的動態場景中仍表現不佳。我們提出DynamicVLA框架,透過三項關鍵設計實現動態物體操作:1)採用卷積視覺編碼器的緊湊型0.4B參數VLA模型,實現空間效率高且結構保真的編碼,支援快速多模態推理;2)連續推理機制,透過重疊執行推理與動作降低延遲,即時適應物體運動;3)潛在感知動作流技術,透過強制時間對齊的動作執行來彌合感知與執行間的差距。為填補動態操作數據的空白,我們建立動態物體操作基準數據集DOM,透過自動化數據採集流程從零構建,高效收集涵蓋2,800個場景、206種物體的20萬條合成操作片段,並可快速採集2,000條無需遙控操作的真實世界片段。大量實驗表明,該框架在響應速度、感知能力和泛化性能上取得顯著提升,使DynamicVLA成為跨具身應用的通用動態物體操作統一框架。
English
Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.