ChatPaper.aiChatPaper

專業無需壟斷:面向視覺-語言-動作學習的行動專長專家混合模型

Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

October 16, 2025
作者: Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, Xinping Guan, Xiaokang Yang, Yao Mu
cs.AI

摘要

視覺-語言-動作(Vision-Language-Action, VLA)模型正經歷快速發展,並在機器人操控任務中展現出顯著的潛力。然而,擴展VLA模型面臨幾項關鍵挑戰:(1)從零開始訓練新的VLA模型需要大量的計算資源和廣泛的數據集。鑒於當前機器人數據的稀缺性,在擴展過程中充分利用預訓練良好的VLA模型權重顯得尤為重要。(2)實時控制要求精確平衡模型容量與計算效率。為應對這些挑戰,我們提出了AdaMoE,這是一種基於專家混合(Mixture-of-Experts, MoE)的架構,它繼承了密集VLA模型的預訓練權重,並通過將前饋層替換為稀疏激活的MoE層來擴展動作專家。AdaMoE採用了一種解耦技術,通過一個獨立的比例適配器與傳統的路由器協同工作,將專家選擇與專家權重分配解耦。這使得專家能夠根據任務相關性被選擇,同時以獨立控制的權重貢獻,實現專家協同利用而非贏家通吃的動態。我們的方法表明,專家知識無需壟斷。相反,通過協同利用專家,我們可以在保持計算效率的同時實現更優的性能。AdaMoE在關鍵基準測試中持續超越基線模型,在LIBERO上提升了1.8%,在RoboTwin上提升了9.3%。最重要的是,在實際實驗中21.5%的顯著提升驗證了其在機器人操控任務中的實際有效性。
English
Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.
PDF112December 21, 2025