언어 에이전트를 위한 정책 및 세계 모델링 공동 훈련

초록

강화 학습(RL)은 어떤 행동이 높은 보상으로 이어지는지 가르침으로써 대규모 언어 모델(LLM) 에이전트를 향상시키지만, 그러한 행동이 환경에 어떤 영향을 미치는지에 대한 감독은 거의 제공하지 않는다. 세계 모델링(WM)은 이러한 격차를 메울 수 있지만, 기존 접근법은 종종 별도의 시뮬레이터, 추가 훈련 단계, 또는 추론 시 추가 계산을 필요로 한다. 우리는 정책 기반(policy-on) RL 롤아웃이 이미 필요한 신호를 포함하고 있음을 관찰한다. 각 전환은 행동과 그 결과로 나타나는 다음 관찰을 짝짓는다. 이 관찰에 기반하여, 우리는 추론 패러다임을 변경하지 않고 RL 중 동일한 정책에 보조 WM 감독을 추가하는 정책 및 세계 모델링 공동 훈련 프레임워크인 PaW를 제안한다. 보조 WM 감독을 정보적이고 안정적으로 만들기 위해 PaW는 세 가지 구성 요소를 도입한다: 행동 엔트로피 기반 WM 데이터 선택, 노이즈 내성 WM 손실, 및 보상 적응형 손실 균형 조정. 세 가지 에이전트 작업 벤치마크에 대한 실험은 다양한 모델 및 RL 알고리즘에 걸쳐 강력한 RL 기준선 대비 일관된 개선을 보여준다. 이러한 결과는 표준 RL 롤아웃이 언어 에이전트 훈련을 위한 실용적인 WM 감독 소스임을 시사한다.

English

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.