EmbRACE-3K: 복잡한 환경에서의 체화된 추론 및 행동

초록

최근의 고급 시각-언어 모델(VLMs)은 수동적이고 오프라인 상태의 이미지 및 비디오 이해 작업에서 강력한 성능을 보여주고 있다. 그러나 온라인 상호작용과 능동적인 장면 이해가 필요한 체화된 환경에서의 효과성은 여전히 제한적이다. 이러한 시나리오에서 에이전트는 1인칭 시점으로 환경을 인지하며, 각 행동이 후속 관찰을 동적으로 형성한다. GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro와 같은 최첨단 모델들도 개방형 환경 상호작용에서 어려움을 겪으며, 공간 추론과 장기적 계획에서 명확한 한계를 보인다. 이러한 격차를 해결하기 위해, 우리는 Unreal Engine과 UnrealCV-Zoo 프레임워크를 사용하여 구축된 다양한 포토리얼리스틱 환경에 위치한 3,000개 이상의 언어-지시 작업 데이터셋인 EmRACE-3K를 소개한다. 이 작업들은 탐색, 객체 조작, 다단계 목표 실행을 포함한 다양한 체화된 도전 과제를 포괄한다. 각 작업은 다단계 궤적으로 전개되며, 1인칭 시각 관찰을 고수준 지시, 근거 있는 행동, 그리고 각 단계에서 에이전트의 의도를 표현하는 자연어 논리와 짝을 이룬다. EmRACE-3K를 사용하여, 우리는 탐색, 동적 공간-의미론적 추론, 다단계 목표 실행이라는 세 가지 핵심 차원에서 VLMs의 체화된 추론 능력을 평가하기 위한 벤치마크를 설정한다. 제로샷 설정에서 모든 모델은 20% 미만의 성공률을 보이며, 우리의 벤치마크가 제기하는 도전과 상호작용 환경에서 VLMs의 현재 한계를 강조한다. EmRACE-3K의 유용성을 입증하기 위해, 우리는 지도 학습과 강화 학습을 통해 Qwen2.5-VL-7B를 추가로 미세 조정한다. 이 접근법은 세 가지 도전 범주 모두에서 상당한 개선을 가져오며, 체화된 추론 능력 개발을 가능하게 하는 데이터셋의 효과성을 강조한다.

English

Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.

EmbRACE-3K: 복잡한 환경에서의 체화된 추론 및 행동

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

초록

Support