具身推理者：融合视觉搜索、推理与行动，实现具身交互任务

摘要

近期，深度思维模型在数学和编程任务上展现出了卓越的推理能力。然而，在需要通过图像与动作交替轨迹与环境持续交互的具身领域，其有效性仍待深入探索。我们提出了“具身推理者”模型，该模型将o1风格的推理扩展至交互式具身搜索任务。与主要依赖逻辑演绎的数学推理不同，具身场景要求空间理解、时序推理以及基于交互历史的持续自我反思。为应对这些挑战，我们合成了9.3k条连贯的“观察-思考-行动”轨迹，包含64k张交互图像和90k种多样化的思维过程（分析、空间推理、反思、规划与验证）。我们开发了一个三阶段训练流程，通过模仿学习、基于拒绝采样的自我探索以及反思调优的自我修正，逐步提升模型能力。评估结果显示，我们的模型显著超越了先进的视觉推理模型，例如，它分别以+9%、+24%和+13%的优势超过了OpenAI的o1、o3-mini和Claude-3.7。分析表明，我们的模型在复杂长程任务中展现出更少的重复搜索和逻辑不一致性，具有明显优势。在真实环境中的测试也证实了我们的优越性，同时展现了更少的重复搜索和逻辑不一致情况。

English

Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.

具身推理者：融合视觉搜索、推理与行动，实现具身交互任务

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

摘要

Support