EmbRACE-3K：复杂环境中的具身推理与行动

摘要

近期先進的視覺-語言模型（VLMs）在被動、離線的圖像和視頻理解任務中展現了強大的性能。然而，在需要線上互動和主動場景理解的具身環境中，其效能仍然有限。在此類情境下，智能體以第一人稱視角感知環境，每個動作都會動態地影響後續的觀察。即使是如GPT-4o、Claude 3.5 Sonnet和Gemini 2.5 Pro等最先進的模型，在開放環境的互動中也表現出明顯的局限性，尤其是在空間推理和長時程規劃方面。為填補這一差距，我們引入了EmRACE-3K，這是一個包含超過3,000個語言引導任務的數據集，這些任務設置於使用Unreal Engine和UnrealCV-Zoo框架構建的多樣化、逼真的環境中。這些任務涵蓋了廣泛的具身挑戰，包括導航、物體操作和多階段目標執行。每個任務都作為一個多步驟的軌跡展開，將第一人稱視覺觀察與高層次指令、具體動作以及表達智能體每一步意圖的自然語言理由配對。利用EmRACE-3K，我們建立了一個基準，用於評估VLMs在三個關鍵維度上的具身推理能力：探索、動態空間-語義推理和多階段目標執行。在零樣本設置下，所有模型的成功率均低於20%，這凸顯了我們基準所帶來的挑戰以及VLMs在互動環境中的當前局限性。為展示EmRACE-3K的實用性，我們進一步使用監督學習和強化學習對Qwen2.5-VL-7B進行微調。這一方法在所有三個挑戰類別中均取得了顯著的改進，凸顯了該數據集在促進具身推理能力發展方面的有效性。

English

Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.

EmbRACE-3K：复杂环境中的具身推理与行动

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

摘要

Support