PEAM: 마인크래프트에서 대조적 경험 내재화를 통한 파라미터 기반 체화 에이전트 메모리

초록

본 논문에서는 Minecraft 환경에서 에이전트 메모리를 추론 시점 검색에서 경험을 통해 내재화된 파라미터 상주 기술로 변환하는 PEAM(Parametric Embodied Agent Memory) 프레임워크를 제안한다. PEAM은 개방형 추론을 위한 느린 숙고형 LLM과 통합된 기술의 반사적 실행을 위한 빠른 파라메트릭 모듈을 결합한다. 빠른 모듈은 범주별로 물리적으로 분리된 어댑터를 갖춘 다중 모달 Mixture-of-Experts LoRA 아키텍처로, 파괴적 망각 없이 파라미터 수준의 지속적 학습을 가능하게 한다. 실패를 최우선 학습 신호로 처리하여, 실패-수정 궤적 쌍을 행동 복제 및 대조적 목표 함수를 결합한 방식으로 내재화함으로써, 에이전트는 성공하는 방법뿐만 아니라 수정된 행동이 실패와 어떻게 다른지도 학습한다. 통합을 제어하기 위해, PEAM은 어떤 경험을 내재화할지 결정하는 파라미터화 적합도 점수와, 언제 내재화할지를 결정하는 척도 없는 자기 트리거 통합 메커니즘을 도입한다. 이 메커니즘은 작업별 수동 조정 임계값 없이 트리거가 작업 분포를 가로질러 전이됨에 따라 에이전트가 스스로 진화하게 한다. Minecraft 실험에서 PEAM은 장기 과제 수행 능력을 향상시키고, 이전에 통합된 기술에 대한 망각을 완화하며, 검색 기반 체화 에이전트 및 파라메트릭 메모리 변형에 비해 파라메트릭 대비 검색 효율을 개선함을 보여준다.

English

We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into parameter-resident skills internalized through experience. PEAM pairs a slow deliberative LLM for open-ended reasoning with a fast parametric module for reflexive execution of consolidated skills. The fast module is a multimodal Mixture-of-Experts LoRA architecture with per-category physically isolated adapters, enabling parameter-level continual learning without catastrophic forgetting. We treat failure as a first-class training signal: failure--correction trajectory pairs are internalized through a joint behavioral-cloning and contrastive objective, so the agent learns not only what succeeds but also how corrected actions differ from failed ones. To govern consolidation, PEAM introduces a parameterization-worthiness score for deciding which experience should be internalized, and a scale-free self-triggered consolidation mechanism for deciding when to internalize without task-specific hand-tuned thresholds, making the agent self-evolving as the trigger transfers across task distributions without re-tuning. Experiments in Minecraft show that PEAM improves long-horizon task performance, mitigates forgetting on previously consolidated skills, and improves parametric-versus-retrieval efficiency over retrieval-based embodied agents and parametric memory variants.