試行錯誤から学ぶ：具現化されたLLMのための内省的テスト時計画

要旨

具体化された大規模言語モデル（LLM）はロボットに高水準のタスク推論能力を付与するが、何がなぜ失敗したかを省みることはできず、展開は独立した試行の連続となり、誤りが繰り返されるだけで経験として蓄積されない。人間の反省的実践家の概念に基づき、我々は「反省的テスト時計画」を提案する。これは二つの反省モードを統合したものである：行動内反省（リフレクション・イン・アクション）では、エージェントはテスト時スケーリングを用いて、実行前に内的省察を通じて複数の候補行動を生成・評価する。行動後反省（リフレクション・オン・アクション）では、テスト時トレーニングを用いて、実行後の外的省察に基づき、内的反省モデルと行動方策の両方を更新する。さらに、回顧的省察も組み込んでおり、エージェントが過去の決定を再評価し、後知恵を用いたモデル更新を行い、長期的な信用割り当てを適切に行うことを可能にする。新たに設計したLong-Horizon HouseholdベンチマークとMuJoCo Cupboard Fittingベンチマークによる実験では、ベースラインモデルを大幅に上回る性能向上が確認され、 ablation studyにより行動内反省と行動後反省の相補的役割が検証された。実機実験を含む定性分析は、省察による行動修正の様子を明らかにしている。

English

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: reflection-in-action, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and reflection-on-action, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

試行錯誤から学ぶ：具現化されたLLMのための内省的テスト時計画

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

要旨

Support