ChatPaper.aiChatPaper

DualVLA:通过部分解耦推理与行动构建可泛化具身智能体

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

November 27, 2025
作者: Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, Feng Zhao
cs.AI

摘要

为构建具有强大推理能力的通用视觉-语言-动作模型,通常策略是先通过机器人演示数据训练专业版VLA以掌握可靠操作技能,再融合标注的机器人数据与多模态数据来恢复广泛推理能力。然而我们发现,由此产生的推理型VLA往往会出现动作性能相较于微调前的专业模型退化的问题,这一现象被称为动作退化。为解决该问题,我们提出DualVLA模型,通过精心设计的后训练机制在保持推理能力的同时提升动作性能。我们首先引入双层数据筛选方法,剔除冗余的具身推理数据以防止其对动作学习产生负面影响。为进一步强化动作生成,设计了双教师自适应蒸馏策略,在维持推理能力的同时为不同数据域分配差异化监督信号。为填补通用VLA的评估空白,我们还提出VLA综合评分体系,将模型能力解耦为推理、意图、动作和对齐四个维度进行细粒度评估。实验表明,DualVLA在SimplerEnv中达到61.0%的平均成功率,在八大多模态基准测试中取得65.4的平均分,展现出精准动作执行与多模态理解之间更优的平衡能力。项目网站:https://costaliya.github.io/DualVLA/。
English
To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.
PDF161December 2, 2025