관측으로부터 언어 모델 사전 정보를 활용한 POMDP 세계 모델 학습

초록

건물 내비게이션, 로봇 조작, 게임 플레이 등 환경에서 효과적으로 행동하는 에이전트는 먼저 해당 환경이 어떻게 작동하는지에 대한 내부 모델을 학습해야 한다. 부분 관측 마르코프 결정 과정(POMDP)은 이러한 내부 세계 모델에 대한 유연한 모델링 클래스를 제공하지만, 관측-행동 궤적만으로 이를 학습하는 것은 어려우며 일반적으로 광범위한 환경 상호작용이 필요하다. 본 연구에서는 언어 모델 사전 지식(prior)이 사전 지식을 활용하여 비용이 많이 드는 상호작용을 줄일 수 있는지 묻고, Pinductor(POMDP-inductor)를 소개한다: 대규모 언어 모델(LLM)이 소수의 관측-행동 궤적으로부터 후보 POMDP 모델을 제안하고, 이를 반복적으로 개선하여 신념 기반 우도 점수(belief-based likelihood score)를 최적화한다. 엄격히 적은 정보를 사용함에도 불구하고 Pinductor는 은닉 상태에 대한 특권적 접근(privileged access)을 가정하는 LLM 기반 POMDP 학습 방법의 성능 및 샘플 효율성과 일치하며, 테이블 형태 POMDP 기준선(tabular POMDP baselines)의 샘플 효율성을 크게 능가한다. 추가 결과는 성능이 LLM 역량에 따라 확장되며, 환경에 대한 의미 정보(semantic information)가 차단될 때 우아하게 저하됨을 보여준다. 이러한 결과는 언어 모델 사전 지식을 부분 관측 하에서 샘플 효율적인 세계 모델 학습을 위한 실용적 도구로 자리매김하게 하며, 실제 환경에서의 범용 에이전트(generalist agent)를 향한 한 걸음이 된다. 코드는 https://github.com/atomresearch/pinductor에서 확인할 수 있다.

English

Whether navigating a building, operating a robot, or playing a game, an agent that acts effectively in an environment must first learn an internal model of how that environment works. Partially-observable Markov decision processes (POMDPs) provide a flexible modeling class for such internal world models, but learning them from observation-action trajectories alone is challenging and typically requires extensive environment interaction. We ask whether language-model priors can reduce costly interaction by leveraging prior knowledge, and introduce Pinductor (POMDP-inductor): an LLM proposes candidate POMDP models from a few observation-action trajectories and iteratively refines them to optimize a belief-based likelihood score. Despite using strictly less information, Pinductor matches the performance and sample efficiency of LLM-based POMDP learning methods that assume privileged access to the hidden state, while significantly surpassing the sample efficiency of tabular POMDP baselines. Further results show that performance scales with LLM capability and degrades gracefully as semantic information about the environment is withheld. Together, these results position language-model priors as a practical tool for sample-efficient world-model learning under partial observability, and a step toward generalist agents in real-world environments. Code is available at https://github.com/atomresearch/pinductor.