EmbRACE-3K: 複雑環境における身体化された推論と行動

要旨

近年の高度な視覚言語モデル（VLMs）は、受動的でオフラインの画像および映像理解タスクにおいて優れた性能を発揮しています。しかし、オンラインでのインタラクションと能動的なシーン理解を必要とするエンボディド（身体性を持った）環境での有効性は限定的です。このようなシナリオでは、エージェントは一人称視点で環境を認識し、各アクションがその後の観測を動的に形成します。GPT-4o、Claude 3.5 Sonnet、Gemini 2.5 Proといった最先端のモデルでさえ、オープン環境でのインタラクションに苦戦し、空間推論や長期的な計画立案において明らかな限界を示しています。このギャップを埋めるため、私たちはEmRACE-3Kを紹介します。これはUnreal EngineとUnrealCV-Zooフレームワークを使用して構築された多様でフォトリアルな環境に位置する3,000以上の言語ガイド付きタスクのデータセットです。これらのタスクは、ナビゲーション、物体操作、多段階の目標実行など、幅広いエンボディド課題を網羅しています。各タスクは多段階の軌跡として展開され、一人称視点の視覚観測と高レベルの指示、接地されたアクション、そして各ステップでのエージェントの意図を表す自然言語による根拠がペアになっています。EmRACE-3Kを使用して、私たちはVLMsのエンボディド推論能力を3つの主要な次元（探索、動的空間-意味推論、多段階目標実行）で評価するベンチマークを確立しました。ゼロショット設定では、すべてのモデルの成功率が20%未満であり、私たちのベンチマークが提示する課題と、インタラクティブ環境におけるVLMsの現在の限界が浮き彫りになりました。EmRACE-3Kの有用性を実証するため、私たちはさらにQwen2.5-VL-7Bを教師あり学習と強化学習を用いてファインチューニングしました。このアプローチにより、3つの課題カテゴリーすべてで大幅な改善が見られ、エンボディド推論能力の開発においてデータセットの有効性が強調されました。

English

Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.

EmbRACE-3K: 複雑環境における身体化された推論と行動

EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

要旨

Support