ChatPaper.aiChatPaper

DualVLA:透過部分解耦推理與行動構建可泛化具身智能體

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

November 27, 2025
作者: Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, Feng Zhao
cs.AI

摘要

為建構具備強大推理能力的通用視覺-語言-動作模型,常見策略是先透過機器人示範資料訓練專業化VLA模型以掌握可靠操作技能,再融合混合標註的機器人資料與多模態資料來恢復廣泛推理能力。然而我們發現,經過微調後的推理型VLA模型往往會出現動作性能相較微調前的專業模型衰退的現象,此現象稱為「動作退化」。為解決此問題,我們提出DualVLA模型,透過精心設計的後訓練機制在保持推理能力的同時提升動作性能。我們首先引入雙層資料篩選方法,剔除冗餘的具身推理資料以防止其對動作學習產生負面影響。為進一步強化動作生成能力,設計了雙教師自適應蒸餾策略,針對不同資料域分配差異化監督信號的同時維持推理能力。為填補通用型VLA的評估空白,我們還提出VLA評分體系,將VLA能力解構為推理、意圖、動作與對齊四個維度進行細粒度評估。實驗表明,DualVLA在SimplerEnv中達到61.0%的平均成功率,並在八個具競爭力的多模態基準測試中獲得65.4的平均分數,展現出在精準動作執行與多模態理解之間更優異的平衡能力。項目網站:https://costaliya.github.io/DualVLA/。
English
To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.
PDF161December 2, 2025