EmbRACE-3K:复杂环境中的具身推理与行动
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
July 14, 2025
作者: Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi
cs.AI
摘要
近期先进的视觉-语言模型(VLMs)在被动、离线的图像和视频理解任务中展现了强大的性能。然而,在需要在线交互和主动场景理解的具身环境中,其有效性仍显不足。在此类场景中,智能体以第一人称视角感知环境,每个动作都会动态影响后续观察。即便是GPT-4o、Claude 3.5 Sonnet和Gemini 2.5 Pro等顶尖模型,在开放环境交互中也面临挑战,在空间推理和长期规划方面表现出明显局限。为填补这一空白,我们推出了EmRACE-3K,一个包含超过3000个语言引导任务的数据集,这些任务设置于使用Unreal Engine和UnrealCV-Zoo框架构建的多样化、逼真环境中。任务涵盖了导航、物体操作和多阶段目标执行等广泛的具身挑战。每个任务都作为多步轨迹展开,将第一人称视觉观察与高层指令、具体动作及表达智能体每一步意图的自然语言推理配对。利用EmRACE-3K,我们建立了一个基准,用于评估VLMs在探索、动态空间语义推理和多阶段目标执行三个关键维度上的具身推理能力。在零样本设置下,所有模型的成功率均低于20%,凸显了我们基准的挑战性及当前VLMs在交互环境中的局限。为展示EmRACE-3K的实用性,我们进一步通过监督学习后接强化学习对Qwen2.5-VL-7B进行微调。这一方法在所有三个挑战类别上均带来了显著提升,证明了该数据集在促进具身推理能力发展方面的有效性。
English
Recent advanced vision-language models(VLMs) have demonstrated strong
performance on passive, offline image and video understanding tasks. However,
their effectiveness in embodied settings, which require online interaction and
active scene understanding remains limited. In such scenarios, an agent
perceives the environment from a first-person perspective, with each action
dynamically shaping subsequent observations. Even state-of-the-art models such
as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment
interactions, exhibiting clear limitations in spatial reasoning and
long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset
of over 3,000 language-guided tasks situated in diverse, photorealistic
environments constructed using Unreal Engine and the UnrealCV-Zoo framework.
The tasks encompass a wide range of embodied challenges, including navigation,
object manipulation, and multi-stage goal execution. Each task unfolds as a
multi-step trajectory, pairing first-person visual observations with high-level
instructions, grounded actions, and natural language rationales that express
the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to
evaluate the embodied reasoning capabilities of VLMs across three key
dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage
Goal Execution. In zero-shot settings, all models achieve success rates below
20%, underscoring the challenge posed by our benchmark and the current
limitations of VLMs in interactive environments. To demonstrate the utility of
EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning
followed by reinforcement learning. This approach yields substantial
improvements across all three challenge categories, highlighting the dataset's
effectiveness in enabling the development of embodied reasoning capabilities.