RLVR-World:运用强化学习训练世界模型
RLVR-World: Training World Models with Reinforcement Learning
May 20, 2025
作者: Jialong Wu, Shaofeng Yin, Ningya Feng, Mingsheng Long
cs.AI
摘要
世界模型通过预测状态在行动下的转移,正日益在多种模态中得到发展。然而,诸如最大似然估计(MLE)等标准训练目标常与世界模型的任务特定目标——如转移预测的准确性或感知质量——存在偏差。本文提出RLVR-World,一个统一框架,它利用可验证奖励的强化学习(RLVR)直接针对这些指标优化世界模型。尽管将世界建模形式化为对标记化序列的自回归预测,RLVR-World却通过解码预测的指标作为可验证奖励进行评估。我们在包括文本游戏、网页导航及机器人操作等多个领域内,展示了基于语言和视频的世界模型在性能上的显著提升。我们的研究表明,除了近期在推理语言模型方面的进展外,RLVR为更广泛地提升生成模型的实用性提供了一个有前景的后训练范式。
English
World models predict state transitions in response to actions and are
increasingly developed across diverse modalities. However, standard training
objectives such as maximum likelihood estimation (MLE) often misalign with
task-specific goals of world models, i.e., transition prediction metrics like
accuracy or perceptual quality. In this paper, we present RLVR-World, a unified
framework that leverages reinforcement learning with verifiable rewards (RLVR)
to directly optimize world models for such metrics. Despite formulating world
modeling as autoregressive prediction of tokenized sequences, RLVR-World
evaluates metrics of decoded predictions as verifiable rewards. We demonstrate
substantial performance gains on both language- and video-based world models
across domains, including text games, web navigation, and robot manipulation.
Our work indicates that, beyond recent advances in reasoning language models,
RLVR offers a promising post-training paradigm for enhancing the utility of
generative models more broadly.Summary
AI-Generated Summary