从试错中学习：具身大语言模型的反思式测试时规划

摘要

具身大语言模型虽赋予机器人高层次任务推理能力，但其无法反思错误成因，导致部署过程沦为一系列独立试错——错误不断重复而非沉淀为经验。受人类反思实践者启发，我们提出反射式测试时规划框架，融合两种反思模式：行动中反思（通过测试时扩展生成多个候选动作，在执行前利用内部反思进行评分）与行动后反思（通过测试时训练，在执行后基于外部反思更新内部反思模型与行动策略）。我们还引入回溯性反思机制，使智能体能重新评估早期决策，并基于事后认知进行模型更新，实现精准的长周期信用分配。在全新设计的长周期家庭任务基准与MuJoCo橱柜装配基准上的实验表明，该方法显著超越基线模型，消融研究验证了两种反思模式的互补性。包括真实机器人试验在内的定性分析，凸显了反思机制对行为修正的有效性。

English

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: reflection-in-action, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and reflection-on-action, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

从试错中学习：具身大语言模型的反思式测试时规划

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

摘要

Support