从试错中学习:具身大语言模型的反思式测试时规划
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
February 24, 2026
作者: Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi
cs.AI
摘要
具身大语言模型虽赋予机器人高层次任务推理能力,但其无法反思错误成因,导致部署过程沦为一系列独立试错——错误不断重复而非沉淀为经验。受人类反思实践者启发,我们提出反射式测试时规划框架,融合两种反思模式:行动中反思(通过测试时扩展生成多个候选动作,在执行前利用内部反思进行评分)与行动后反思(通过测试时训练,在执行后基于外部反思更新内部反思模型与行动策略)。我们还引入回溯性反思机制,使智能体能重新评估早期决策,并基于事后认知进行模型更新,实现精准的长周期信用分配。在全新设计的长周期家庭任务基准与MuJoCo橱柜装配基准上的实验表明,该方法显著超越基线模型,消融研究验证了两种反思模式的互补性。包括真实机器人试验在内的定性分析,凸显了反思机制对行为修正的有效性。
English
Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: reflection-in-action, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and reflection-on-action, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.