UniRL：教師あり学習と強化学習による統一マルチモーダルモデルの自己改善

要旨

Show-oやJanusなどの統一マルチモーダル大規模言語モデルは、生成タスクと理解タスクの両方で高い性能を達成しています。しかし、これらのモデルは通常、大規模なデータセットに依存し、事前学習段階で相当な計算リソースを必要とします。さらに、いくつかの事後学習手法が提案されていますが、それらは外部データに依存したり、タスク固有のカスタマイズに限定されることが多いです。本研究では、自己改善型の事後学習アプローチであるUniRLを紹介します。我々のアプローチでは、モデルがプロンプトから画像を生成し、それを各イテレーションで学習データとして使用することが可能で、外部の画像データに依存しません。さらに、生成と理解の2つのタスクが相互に強化される仕組みを提供します：生成された画像は理解に使用され、理解の結果は生成を監督するために使用されます。我々は、モデルを最適化するために、教師ありファインチューニング（SFT）とGroup Relative Policy Optimization（GRPO）を探索します。UniRLには3つの主要な利点があります：（1）すべての学習サンプルが学習中にモデル自身によって生成されるため、外部の画像データを必要としない、（2）個々のタスクの性能を向上させるだけでなく、生成と理解の間の不均衡を軽減する、（3）事後学習段階でわずかな追加の学習ステップしか必要としない。我々は、Show-oとJanusの上でUniRLを評価し、Show-oで0.77、Janusで0.65のGenEvalスコアを達成しました。コードとモデルはhttps://github.com/showlab/UniRLで公開されます。

English

Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.

UniRL：教師あり学習と強化学習による統一マルチモーダルモデルの自己改善

UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

要旨

Support