ChatPaper.aiChatPaper

我们准备好迎接文本到3D生成中的强化学习了吗?一项渐进式研究

Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

December 11, 2025
作者: Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao
cs.AI

摘要

强化学习(RL)此前已被证明对大型语言模型与多模态模型有效,近期更成功拓展至增强二维图像生成领域。然而,由于三维物体具有更高的空间复杂性,需兼顾全局一致的几何结构与细粒度局部纹理,将RL应用于三维生成的研究仍处于空白。这一特性使得三维生成对奖励函数设计和RL算法极为敏感。为应对这些挑战,我们首次从多维度系统研究了基于RL的文本到三维自回归生成方法:(1)奖励设计:通过评估奖励维度与模型选择,发现与人类偏好对齐至关重要,且通用多模态模型能为三维属性提供稳健信号;(2)RL算法:研究GRPO算法变体,证明词元级优化的有效性,并深入探索训练数据与迭代次数的缩放规律;(3)三维生成基准:针对现有基准无法衡量三维生成模型隐含推理能力的问题,提出MME-3DR基准;(4)先进RL范式:受三维生成天然层次性启发,提出Hi-GRPO方法,通过专用奖励组合优化从全局到局部的层次化三维生成。基于这些发现,我们开发出AR3D-R1——首个RL增强的文本到三维生成模型,实现了从粗粒度形状到纹理细化的全流程优化。本研究旨在为RL驱动的三维生成推理提供新见解。代码已发布于https://github.com/Ivan-Tang-3D/3DGen-R1。
English
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.
PDF362December 13, 2025