RLVR-World：利用強化學習訓練世界模型

摘要

世界模型預測在行動影響下的狀態轉變，並在多種模態中日益發展。然而，標準的訓練目標，如最大似然估計（MLE），往往與世界模型的特定任務目標（即轉變預測的度量標準，如準確性或感知質量）不一致。本文介紹了RLVR-World，這是一個統一框架，利用可驗證獎勵的強化學習（RLVR）直接針對這些度量標準優化世界模型。儘管將世界建模表述為對標記化序列的自回歸預測，RLVR-World仍將解碼預測的度量標準作為可驗證獎勵進行評估。我們在文本遊戲、網絡導航和機器人操作等多個領域的語言和視頻基礎世界模型上展示了顯著的性能提升。我們的工作表明，除了最近在推理語言模型方面的進展外，RLVR為更廣泛地提升生成模型的效用提供了一個有前景的後訓練範式。

English

World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task-specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR-World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning language models, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly.

RLVR-World：利用強化學習訓練世界模型

RLVR-World: Training World Models with Reinforcement Learning

摘要

Support