RIG: 엔드투엔드 범용 정책에서 추론과 상상력의 시너지 효과

초록

복잡한 개방형 환경에서 작동하는 구체화된 에이전트에게는 행동 전 사고와 잠재적 결과 상상(즉, 세계 모델)이 필수적입니다. 그러나 기존 연구는 종단 간(end-to-end) 에이전트에 이러한 능력 중 하나만 통합하거나, 여러 전문화된 모델을 에이전트 시스템에 통합하여 정책의 학습 효율성과 일반화를 제한했습니다. 따라서 본 논문은 사고(Reasoning)와 상상(Imagination)을 종단 간 일반주의 정책(Generalist policy)인 RIG로 최초로 통합하려는 시도를 합니다. RIG를 종단 간 방식으로 학습시키기 위해, 우리는 기존 에이전트에서 수집된 궤적에 상상과 사고의 내용을 점진적으로 통합하고 풍부하게 만드는 데이터 파이프라인을 구축했습니다. 사고와 다음 이미지 생성을 함께 학습함으로써 사고, 행동, 환경 역학 간의 내재적 상관관계를 명시적으로 모델링하여, 기존 연구 대비 17배 이상의 샘플 효율성 향상과 일반화를 달성했습니다. 추론 과정에서 RIG는 먼저 다음 행동을 사고하고, 잠재적 행동을 생성한 후, 행동 결과를 예측함으로써 에이전트가 실제 행동을 취하기 전에 상상을 바탕으로 검토하고 자기 수정할 기회를 제공합니다. 실험 결과는 사고와 상상의 시너지가 일반주의 정책의 견고성, 일반화, 상호 운용성을 향상시킬 뿐만 아니라 테스트 시 스케일링을 통해 전반적인 성능을 강화할 수 있음을 보여줍니다.

English

Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than 17times sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.