시각적 체화된 두뇌: 다중모드 대형 언어 모델이 공간에서 보고, 생각하고, 제어하도록 하기

초록

다중모달 대형 언어 모델(MLLMs)의 놀라운 발전은 이를 다리 달린 로봇과 같은 물리적 개체로 확장하려는 관심을 끌어모으고 있습니다. 이는 일반적으로 MLLMs가 다중모달 이해 능력을 파악할 뿐만 아니라 시각-공간 추론 및 물리적 상호작용 능력을 통합해야 함을 요구합니다. 그러나 기존 방법들은 이러한 능력들의 근본적인 차이로 인해 이를 통합하는 데 어려움을 겪고 있습니다. 본 논문에서는 실세계에서의 인지, 추론 및 제어를 위한 통합 프레임워크인 Visual Embodied Brain(VeBrain)을 제시합니다. VeBrain은 로봇 제어를 2D 시각 공간에서의 일반적인 텍스트 기반 MLLM 작업으로 재구성함으로써 다양한 작업의 목표와 매핑 공간을 통일합니다. 그런 다음, MLLMs에서 나오는 텍스트 제어 신호를 실제 로봇의 동작 정책으로 변환하기 위한 새로운 로봇 어댑터를 제안합니다. 데이터 관점에서, 우리는 VeBrain의 다양한 능력을 포괄하는 고품질 명령 데이터셋인 VeBrain-600k를 추가로 소개합니다. VeBrain-600k에서는 수백 시간을 들여 데이터를 수집, 정리 및 주석 처리하고, 다중모달 사고의 연쇄(CoT)를 채택하여 다양한 능력을 단일 대화로 혼합합니다. 13개의 다중모달 벤치마크와 5개의 공간 지능 벤치마크에서의 광범위한 실험을 통해 VeBrain이 Qwen2.5-VL과 같은 기존 MLLMs보다 우수한 성능을 보임을 입증합니다. 다리 달린 로봇과 로봇 팔에 배포될 때, VeBrain은 기존 방법에 비해 강력한 적응성, 유연성 및 구성 능력을 보여줍니다. 예를 들어, Qwen2.5-VL과 비교하여 VeBrain은 MMVet에서 +5.6%의 상당한 성능 향상을 달성할 뿐만 아니라 다리 달린 로봇 작업에서도 평균 +50%의 성능 향상을 보입니다.

English

The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental differences.In this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce VeBrain-600k, a high-quality instruction dataset encompassing various capabilities of VeBrain. In VeBrain-600k, we take hundreds of hours to collect, curate and annotate the data, and adopt multimodal chain-of-thought(CoT) to mix the different capabilities into a single conversation. Extensive experiments on 13 multimodal benchmarks and 5 spatial intelligence benchmarks demonstrate the superior performance of VeBrain to existing MLLMs like Qwen2.5-VL. When deployed to legged robots and robotic arms, VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods. For example, compared to Qwen2.5-VL, VeBrain not only achieves substantial gains on MMVet by +5.6%, but also excels in legged robot tasks with +50% average gains.

시각적 체화된 두뇌: 다중모드 대형 언어 모델이 공간에서 보고, 생각하고, 제어하도록 하기

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

초록

Support