UniRL：通过监督学习与强化学习实现统一多模态模型的自我优化

摘要

統一的多模態大型語言模型，如Show-o和Janus，在生成和理解任務上均展現了卓越的性能。然而，這些模型通常依賴於大規模數據集，並在預訓練階段需要大量的計算資源。此外，雖然已提出了多種後訓練方法，但它們往往依賴於外部數據或僅限於特定任務的定制。在本研究中，我們引入了UniRL，一種自我改進的後訓練方法。該方法使模型能夠從提示生成圖像，並在每次迭代中將這些圖像作為訓練數據使用，而無需依賴任何外部圖像數據。更重要的是，它實現了兩項任務的相互促進：生成的圖像用於理解，而理解結果則用於指導生成。我們探索了監督微調（SFT）和群組相對策略優化（GRPO）來優化模型。UniRL具有三大優勢：（1）它無需外部圖像數據，所有訓練樣本均由模型在訓練過程中自行生成；（2）它不僅提升了單個任務的性能，還減少了生成與理解之間的不平衡；（3）在後訓練階段，它僅需增加少量訓練步驟。我們在Show-o和Janus上對UniRL進行了評估，Show-o的GenEval得分達到了0.77，Janus為0.65。代碼和模型將在https://github.com/showlab/UniRL發布。

English

Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.

UniRL：通过监督学习与强化学习实现统一多模态模型的自我优化

UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

摘要

Support