Embodied-R1.5: 체화된 기반 모델을 통한 물리적 지능의 진화

초록

본 논문에서는 Embodied-R1.5를 소개한다. 이는 일반적 물리 지능을 향한 단일 아키텍처 내에서 체화된 인지, 작업 계획, 수정 및 지시를 포괄하는 포괄적인 체화 추론 능력을 통합한 통합 체화 기반 모델(EFM)이다. 세 가지 자동화된 데이터 구축 파이프라인을 활용하여 핵심 능력의 데이터 범위를 크게 확장하고, 150억 개 이상의 토큰으로 구성된 대규모 데이터 시스템을 구축했으며, 이질적 작업 충돌을 완화하기 위해 다중 작업 균형 강화 학습(RL) 레시피를 설계했다. 또한, 단일 모델이 장기적 작업을 자율적으로 실행하고 자체 수정할 수 있도록 하는 계획자-근거자-수정자(PGC) 폐쇄 루프 프레임워크를 도입한다. 단 80억 개의 파라미터만으로 Embodied-R1.5는 24개의 체화 VLM(비전-언어 모델) 벤치마크 중 16개에서 최고 성능(SOTA)을 달성하여 Gemini-Robotics-ER-1.5 및 GPT-5.4와 같은 선도적 모델을 능가한다. 내재화된 체화 능력 덕분에 Embodied-R1.5는 소량의 데이터만으로 VLA(비전-언어-행동 모델)로 미세 조정될 수 있으며, 4개의 인기 조작 벤치마크 제품군에서 π_{0.5}와 같은 선도적 VLA 모델을 능가하는 성능을 보인다. 또한, 광범위한 제로샷 실제 로봇 실험을 수행하여 명령 따르기, 어포던스 근거화, 관절 객체 조작 및 장기적 복잡 작업에서 성능을 검증함으로써 물리적 세계에 대한 강력한 일반화 능력을 입증했다. 모델 가중치, 데이터세트, 학습 코드, 그리고 체화 작업에 특화된 평가 프레임워크인 EmbodiedEvalKit을 오픈소스로 공개하여 향후 EFM 연구를 촉진하고자 한다.

English

We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like π_{0.5} across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.