专家无需独占:面向视觉-语言-动作学习的行动专用专家混合模型
Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning
October 16, 2025
作者: Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, Xinping Guan, Xiaokang Yang, Yao Mu
cs.AI
摘要
视觉-语言-动作(VLA)模型正经历快速发展,并在机器人操控任务中展现出显著潜力。然而,扩展VLA模型面临几大关键挑战:(1)从头训练新的VLA模型需要大量计算资源和广泛数据集。鉴于当前机器人数据的稀缺性,在扩展过程中充分利用预训练良好的VLA模型权重显得尤为重要。(2)实时控制要求精细平衡模型容量与计算效率。为应对这些挑战,我们提出了AdaMoE,一种继承自密集VLA模型预训练权重的专家混合(MoE)架构,并通过将前馈层替换为稀疏激活的MoE层来扩展动作专家。AdaMoE采用了解耦技术,通过独立的比例适配器与传统路由器协同工作,将专家选择与专家权重分配解耦。这使得专家能基于任务相关性被选择,同时以独立控制的权重贡献,实现专家协作而非赢家通吃的动态。我们的方法证明,专家能力无需独占,通过协作利用专家,我们能在保持计算效率的同时实现更优性能。AdaMoE在关键基准测试中持续超越基线模型,在LIBERO上提升1.8%,在RoboTwin上提升9.3%。最重要的是,现实世界实验中21.5%的显著改进验证了其在机器人操控任务中的实际有效性。
English
Vision-Language-Action (VLA) models are experiencing rapid development and
demonstrating promising capabilities in robotic manipulation tasks. However,
scaling up VLA models presents several critical challenges: (1) Training new
VLA models from scratch demands substantial computational resources and
extensive datasets. Given the current scarcity of robot data, it becomes
particularly valuable to fully leverage well-pretrained VLA model weights
during the scaling process. (2) Real-time control requires carefully balancing
model capacity with computational efficiency. To address these challenges, We
propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits
pretrained weights from dense VLA models, and scales up the action expert by
substituting the feedforward layers into sparsely activated MoE layers. AdaMoE
employs a decoupling technique that decouples expert selection from expert
weighting through an independent scale adapter working alongside the
traditional router. This enables experts to be selected based on task relevance
while contributing with independently controlled weights, allowing
collaborative expert utilization rather than winner-takes-all dynamics. Our
approach demonstrates that expertise need not monopolize. Instead, through
collaborative expert utilization, we can achieve superior performance while
maintaining computational efficiency. AdaMoE consistently outperforms the
baseline model across key benchmarks, delivering performance gains of 1.8% on
LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement
in real-world experiments validates its practical effectiveness for robotic
manipulation tasks.