エンボディド・リーズナー：視覚探索、推論、行動のシナジーによるエンボディドインタラクティブタスクの実現

要旨

深層思考モデルの最近の進展は、数学やコーディングタスクにおいて顕著な推論能力を示してきました。しかし、画像と行動が交互に連なる軌跡を通じて環境と継続的に相互作用を必要とする具現化された領域での有効性は、ほとんど未探査のままです。本論文では、具現化された探索タスクにo1スタイルの推論を拡張するモデル「Embodied Reasoner」を提案します。論理的推論に主に依存する数学的推論とは異なり、具現化されたシナリオでは空間理解、時間的推論、および相互作用の履歴に基づく継続的な自己省察が要求されます。これらの課題に対処するため、我々は9.3kの一貫した「観察-思考-行動」軌跡を合成し、64kのインタラクティブな画像と90kの多様な思考プロセス（分析、空間推論、省察、計画、検証）を含むデータセットを作成しました。模倣学習、リジェクトサンプリングによる自己探索、省察チューニングによる自己修正を通じて、モデルの能力を段階的に向上させる3段階のトレーニングパイプラインを開発しました。評価の結果、我々のモデルは先進的な視覚推論モデル（例：OpenAI o1、o3-mini、Claude-3.7）を+9%、24%、+13%上回りました。分析によると、我々のモデルは繰り返し検索や論理的不整合が少なく、特に複雑な長期タスクにおいて優位性を示しています。実世界の環境でも、繰り返し検索や論理的不整合のケースが少ないという点で我々の優位性が確認されました。

English

Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.

エンボディド・リーズナー：視覚探索、推論、行動のシナジーによるエンボディドインタラクティブタスクの実現

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

要旨

Support