我們準備好在文字轉3D生成領域應用強化學習了嗎?一項漸進式研究
Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
December 11, 2025
作者: Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao
cs.AI
摘要
強化學習(RL)早期已被證實能有效應用於大型語言與多模態模型,近期更成功擴展至增強二維影像生成領域。然而,由於三維物件具有更高的空間複雜度,需兼顧全域一致的幾何結構與細粒度局部紋理,使得RL在三維生成領域的應用仍鮮少被探索。這種特性導致三維生成對獎勵設計與RL演算法極為敏感。為應對這些挑戰,我們首度從多個維度系統性研究RL在文本到三維自回歸生成中的應用:(1)獎勵設計:評估獎勵維度與模型選擇,證明符合人類偏好的對齊性至關重要,且通用多模態模型能為三維屬性提供穩健信號;(2)RL演算法:研究GRPO變體,凸顯詞元級優化的有效性,並深入探討訓練數據與迭代次數的規模化影響;(3)文本到三維基準測試:鑒於現有基準無法衡量三維生成模型的隱式推理能力,我們提出MME-3DR新基準;(4)先進RL範式:受三維生成自然層級結構啟發,我們提出Hi-GRPO方法,透過專屬獎勵集成實現從全域到局部的層次化三維生成優化。基於這些發現,我們開發出首個RL增強型文本到三維模型AR3D-R1,該模型具備從粗粒度形狀到紋理細化的專業生成能力。本研究期望為RL驅動的三維生成推理提供新見解。程式碼已發佈於:https://github.com/Ivan-Tang-3D/3DGen-R1。
English
Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.