TwinBrainVLA:透過非對稱混合轉換器釋放通用視覺語言模型在具身任務中的潛力
TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
January 20, 2026
作者: Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, Kai Chen
cs.AI
摘要
標準的視覺-語言-行動模型通常會針對機器人控制任務,對單體式的視覺-語言模型骨幹進行微調。然而這種方法會在高層次通用語義理解能力與低層次精細感測運動技能學習之間產生嚴重衝突,往往導致模型出現開放世界能力的「災難性遺忘」。為解決此矛盾,我們提出TwinBrainVLA創新架構,通過協調保留通用語義理解能力的通才型VLM與專注於具身本體感知的專家型VLM,實現聯合機器人控制。該架構通過新型非對稱混合變換器機制,將保留強健通用視覺推理能力的凍結「左腦」與專精具身感知的可訓練「右腦」進行協同整合。此設計使右腦能動態從凍結左腦查詢語義知識,並與本體感知狀態融合,為流匹配動作專家模組生成精確連續控制指令提供豐富的條件資訊。在SimplerEnv與RoboCasa基準測試中的大量實驗表明,TwinBrainVLA在實現優越操作性能的同時,顯著保留了預訓練VLM的全面視覺理解能力,為構建同時具備高層次語義理解與低層次物理操作能力的通用機器人提供了可行方向。
English
Standard Vision-Language-Action (VLA) models typically fine-tune a monolithic Vision-Language Model (VLM) backbone explicitly for robotic control. However, this approach creates a critical tension between maintaining high-level general semantic understanding and learning low-level, fine-grained sensorimotor skills, often leading to "catastrophic forgetting" of the model's open-world capabilities. To resolve this conflict, we introduce TwinBrainVLA, a novel architecture that coordinates a generalist VLM retaining universal semantic understanding and a specialist VLM dedicated to embodied proprioception for joint robotic control. TwinBrainVLA synergizes a frozen "Left Brain", which retains robust general visual reasoning, with a trainable "Right Brain", specialized for embodied perception, via a novel Asymmetric Mixture-of-Transformers (AsyMoT) mechanism. This design allows the Right Brain to dynamically query semantic knowledge from the frozen Left Brain and fuse it with proprioceptive states, providing rich conditioning for a Flow-Matching Action Expert to generate precise continuous controls. Extensive experiments on SimplerEnv and RoboCasa benchmarks demonstrate that TwinBrainVLA achieves superior manipulation performance compared to state-of-the-art baselines while explicitly preserving the comprehensive visual understanding capabilities of the pre-trained VLM, offering a promising direction for building general-purpose robots that simultaneously achieve high-level semantic understanding and low-level physical dexterity.