시행착오를 통한 학습: 구현된 LLM을 위한 반성적 테스트 타임 계획

초록

구현된 대형 언어 모델(LLM)은 로봇에게 높은 수준의 작업 추론 능력을 부여하지만, 무엇이 잘못되었는지 또는 그 이유를 반성할 수 없어 배포 과정이 실수가 경험으로 축적되기보다 반복되는 일련의 독립적인 시도로 전락합니다. 인간의 반성적 실천가 개념에서 착안하여, 우리는 두 가지 반성 모드를 통합한 '반성적 실행 시 계획(Reflective Test-Time Planning)'을 제안합니다. 첫째, 실행 중 반성(reflection-in-action)은 에이전트가 실행 전 내적 성찰을 통해 여러 후보 행동을 생성하고 평가하는 실행 시 확장(test-time scaling)을 활용합니다. 둘째, 실행 후 반성(reflection-on-action)은 실행 후 외적 성찰을 바탕으로 내적 반성 모델과 행동 정책을 동시에 업데이트하는 실행 시 훈련(test-time training)을 사용합니다. 또한 회고적 반성(retrospective reflection)을 포함하여, 에이전트가 이전 결정을 재평가하고 후견적 지식으로 모델을 업데이트함으로써 장기적 책임 귀속을 적절히 수행할 수 있도록 합니다. 새롭게 설계한 장기 가정 작업 벤치마크(Long-Horizon Household Benchmark)와 MuJoCo 캐비닛 적합 작업 벤치마크(MuJoCo Cupboard Fitting Benchmark)에서의 실험 결과, 기준 모델 대비 유의미한 성능 향상을 보였으며, ablation 연구를 통해 실행 중 반성과 실행 후 반성의 상호 보완적 역할을 검증하였습니다. 실제 로봇 실험을 포함한 정성적 분석은 반성을 통한 행동 수정 효과를 부각시킵니다.

English

Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: reflection-in-action, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and reflection-on-action, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.

시행착오를 통한 학습: 구현된 LLM을 위한 반성적 테스트 타임 계획

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

초록

Support