TwinBrainVLA: 비대칭형 트랜스포머 혼합 모델을 통한 임보디드 태스크용 범용 VLM의 잠재력 활용

초록

표준 비전-언어-행동(VLA) 모델은 일반적으로 로봇 제어를 위해 단일 비전-언어 모델(VLM) 백본을 명시적으로 미세 조정합니다. 그러나 이러한 접근 방식은 높은 수준의 일반적인 의미론적 이해를 유지하는 것과 낮은 수준의 정밀한 감각운동 기술을 학습하는 사이에 중요한 긴장을 초래하며, 종종 모델의 개방형 세계 능력에 대한 '파국적 망각'을 야기합니다. 이러한 갈등을 해결하기 위해 우리는 보편적인 의미론적 이해를 유지하는 일반주의 VLM과 공동 로봇 제어를 위해 구체화된 체감각에 전념하는 전문가 VLM을 조율하는 새로운 아키텍처인 TwinBrainVLA를 소개합니다. TwinBrainVLA는 강력한 일반 시각 추론 능력을 유지하는 고정된 '좌뇌'와 구체화된 인지에 특화된 훈련 가능한 '우뇌'를 새로운 비대칭 혼합 변환기(AsyMoT) 메커니즘을 통해 시너지 효과를 발휘하도록 설계되었습니다. 이 설계를 통해 우뇌는 고정된 좌뇌로부터 의미론적 지식을 동적으로 질의하고 이를 체감각 상태와 융합하여 정확한 연속 제어를 생성하는 Flow-Matching Action Expert에 풍부한 조건 정보를 제공합니다. SimplerEnv 및 RoboCasa 벤치마크에서의 광범위한 실험을 통해 TwinBrainVLA가 최첨단 기준 모델 대비 우수한 조작 성능을 달성하면서 사전 훈련된 VLM의 포괄적인 시각 이해 능력을 명시적으로 보존함을 입증하였으며, 높은 수준의 의미론적 이해와 낮은 수준의 물리적 민첩성을 동시에 달성하는 범용 로봇 구축을 위한 유망한 방향을 제시합니다.

English

Standard Vision-Language-Action (VLA) models typically fine-tune a monolithic Vision-Language Model (VLM) backbone explicitly for robotic control. However, this approach creates a critical tension between maintaining high-level general semantic understanding and learning low-level, fine-grained sensorimotor skills, often leading to "catastrophic forgetting" of the model's open-world capabilities. To resolve this conflict, we introduce TwinBrainVLA, a novel architecture that coordinates a generalist VLM retaining universal semantic understanding and a specialist VLM dedicated to embodied proprioception for joint robotic control. TwinBrainVLA synergizes a frozen "Left Brain", which retains robust general visual reasoning, with a trainable "Right Brain", specialized for embodied perception, via a novel Asymmetric Mixture-of-Transformers (AsyMoT) mechanism. This design allows the Right Brain to dynamically query semantic knowledge from the frozen Left Brain and fuse it with proprioceptive states, providing rich conditioning for a Flow-Matching Action Expert to generate precise continuous controls. Extensive experiments on SimplerEnv and RoboCasa benchmarks demonstrate that TwinBrainVLA achieves superior manipulation performance compared to state-of-the-art baselines while explicitly preserving the comprehensive visual understanding capabilities of the pre-trained VLM, offering a promising direction for building general-purpose robots that simultaneously achieve high-level semantic understanding and low-level physical dexterity.

TwinBrainVLA: 비대칭형 트랜스포머 혼합 모델을 통한 임보디드 태스크용 범용 VLM의 잠재력 활용

TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

초록

Support