从试错中学习：具身大语言模型的反思式测试时规划

摘要

具身化大型语言模型虽赋予机器人高层次任务推理能力，但其无法反思错误成因，导致部署过程沦为一系列独立试错循环——错误不断重复而非积累为经验。受人类反思实践者启发，我们提出反射式测试时规划框架，融合两种反思模式：行动中反思（智能体通过测试时缩放生成多组候选动作，在执行前通过内部反思进行评分）与行动后反思（通过测试时训练，在执行后基于外部反思更新内部反思模型与行动策略）。我们还引入回溯反思机制，使智能体能够重新评估早期决策，并利用事后认知进行模型更新，实现有效的长周期信用分配。在新设计的长周期家庭任务基准与MuJoCo橱柜适配基准上的实验表明，该方法显著超越基线模型，消融研究验证了两种反思模式的互补作用。包括真实机器人试验在内的定性分析，凸显了反思机制对行为修正的促进作用。

English

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: reflection-in-action, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and reflection-on-action, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

从试错中学习：具身大语言模型的反思式测试时规划

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

摘要

Support