具身推理者：融合視覺搜索、推理與行動以完成具身互動任務

摘要

近期深度思維模型的進展在數學和編程任務上展現了卓越的推理能力。然而，在需要通過圖像-動作交織軌跡與環境持續互動的具身領域，其有效性仍未被充分探索。我們提出了具身推理器（Embodied Reasoner），該模型將o1風格的推理擴展到互動式具身搜索任務中。與主要依賴邏輯推導的數學推理不同，具身場景需要空間理解、時間推理以及基於互動歷史的持續自我反思。為應對這些挑戰，我們合成了9.3k條連貫的觀察-思考-動作軌跡，包含64k張互動圖像和90k種多樣化的思維過程（分析、空間推理、反思、規劃和驗證）。我們開發了一個三階段訓練管道，通過模仿學習、拒絕採樣引導的自我探索以及反思調優實現的自我校正，逐步提升模型能力。評估結果顯示，我們的模型顯著超越了那些先進的視覺推理模型，例如，它分別超過OpenAI的o1、o3-mini和Claude-3.7達+9%、24%和+13%。分析表明，我們的模型展現出更少的重複搜索和邏輯不一致性，在複雜的長時程任務中尤具優勢。在真實環境中，我們的模型同樣表現出優越性，同時展現出更少的重複搜索和邏輯不一致情況。

English

Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.

具身推理者：融合視覺搜索、推理與行動以完成具身互動任務

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

摘要

Support