TwinBrainVLA:通过非对称混合变换器释放通用视觉语言模型在具身任务中的潜力
TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
January 20, 2026
作者: Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, Kai Chen
cs.AI
摘要
标准视觉-语言-动作(VLA)模型通常通过微调单一视觉-语言模型(VLM)主干网络来实现机器人控制。然而,这种方法在保持高层次通用语义理解与学习低层次精细感知运动技能之间形成了尖锐矛盾,常导致模型出现开放世界能力的"灾难性遗忘"。为解决这一冲突,我们提出TwinBrainVLA——一种创新架构,通过协调保留通用语义理解的通用VLM与专精于具身本体感知的专用VLM,实现联合机器人控制。该架构通过新型非对称混合变换器(AsyMoT)机制,将保持强健通用视觉推理能力的冻结"左脑"与专攻具身感知的可训练"右脑"相融合。这种设计使得右脑能够动态查询冻结左脑的语义知识,并将其与本体感知状态结合,为流匹配动作专家生成精确连续控制提供丰富条件。在SimplerEnv和RoboCasa基准测试上的大量实验表明,TwinBrainVLA在实现卓越操作性能的同时,显式保留了预训练VLM的全面视觉理解能力,为构建同时具备高层次语义理解与低层次物理灵巧性的通用机器人指明了可行路径。
English
Standard Vision-Language-Action (VLA) models typically fine-tune a monolithic Vision-Language Model (VLM) backbone explicitly for robotic control. However, this approach creates a critical tension between maintaining high-level general semantic understanding and learning low-level, fine-grained sensorimotor skills, often leading to "catastrophic forgetting" of the model's open-world capabilities. To resolve this conflict, we introduce TwinBrainVLA, a novel architecture that coordinates a generalist VLM retaining universal semantic understanding and a specialist VLM dedicated to embodied proprioception for joint robotic control. TwinBrainVLA synergizes a frozen "Left Brain", which retains robust general visual reasoning, with a trainable "Right Brain", specialized for embodied perception, via a novel Asymmetric Mixture-of-Transformers (AsyMoT) mechanism. This design allows the Right Brain to dynamically query semantic knowledge from the frozen Left Brain and fuse it with proprioceptive states, providing rich conditioning for a Flow-Matching Action Expert to generate precise continuous controls. Extensive experiments on SimplerEnv and RoboCasa benchmarks demonstrate that TwinBrainVLA achieves superior manipulation performance compared to state-of-the-art baselines while explicitly preserving the comprehensive visual understanding capabilities of the pre-trained VLM, offering a promising direction for building general-purpose robots that simultaneously achieve high-level semantic understanding and low-level physical dexterity.