ChatPaper.aiChatPaper

从试错中学习:具身大语言模型的反思式测试时规划

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

February 24, 2026
作者: Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi
cs.AI

摘要

具身化大型语言模型虽赋予机器人高层次任务推理能力,但其无法反思错误成因,导致部署过程沦为一系列独立试错循环——错误不断重复而非积累为经验。受人类反思实践者启发,我们提出反射式测试时规划框架,融合两种反思模式:行动中反思(智能体通过测试时缩放生成多组候选动作,在执行前通过内部反思进行评分)与行动后反思(通过测试时训练,在执行后基于外部反思更新内部反思模型与行动策略)。我们还引入回溯反思机制,使智能体能够重新评估早期决策,并利用事后认知进行模型更新,实现有效的长周期信用分配。在新设计的长周期家庭任务基准与MuJoCo橱柜适配基准上的实验表明,该方法显著超越基线模型,消融研究验证了两种反思模式的互补作用。包括真实机器人试验在内的定性分析,凸显了反思机制对行为修正的促进作用。
English
Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: reflection-in-action, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and reflection-on-action, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.
PDF42March 28, 2026