UniRL: 지도 학습과 강화 학습을 통한 통합 멀티모달 모델의 자기 개선

초록

Show-o와 Janus와 같은 통합 멀티모달 대규모 언어 모델은 생성 및 이해 작업 모두에서 강력한 성능을 달성했습니다. 그러나 이러한 모델은 일반적으로 대규모 데이터셋에 의존하며, 사전 학습 단계에서 상당한 계산 자원을 필요로 합니다. 또한, 여러 사후 학습 방법이 제안되었지만, 이들은 종종 외부 데이터에 의존하거나 특정 작업에 한정된 맞춤화에 그치는 경우가 많습니다. 본 연구에서는 자체 개선이 가능한 사후 학습 접근법인 UniRL을 소개합니다. 우리의 접근법은 모델이 프롬프트에서 이미지를 생성하고 이를 각 반복에서 학습 데이터로 사용할 수 있도록 하며, 외부 이미지 데이터에 의존하지 않습니다. 더 나아가, 이 두 작업이 서로를 강화할 수 있도록 합니다: 생성된 이미지는 이해 작업에 사용되고, 이해 결과는 생성 작업을 감독하는 데 사용됩니다. 우리는 지도 미세 조정(SFT)과 그룹 상대 정책 최적화(GRPO)를 통해 모델을 최적화합니다. UniRL은 세 가지 주요 장점을 제공합니다: (1) 모든 학습 샘플이 학습 중에 모델 자체에 의해 생성되므로 외부 이미지 데이터가 필요하지 않습니다; (2) 개별 작업 성능을 향상시킬 뿐만 아니라 생성과 이해 간의 불균형을 줄입니다; (3) 사후 학습 단계에서 단 몇 번의 추가 학습 단계만 필요합니다. 우리는 UniRL을 Show-o와 Janus 위에서 평가하여, Show-o의 경우 GenEval 점수 0.77, Janus의 경우 0.65를 달성했습니다. 코드와 모델은 https://github.com/showlab/UniRL에서 공개될 예정입니다.

English

Unified multimodal large language models such as Show-o and Janus have achieved strong performance across both generation and understanding tasks. However, these models typically rely on large-scale datasets and require substantial computation during the pretraining stage. In addition, several post-training methods have been proposed, but they often depend on external data or are limited to task-specific customization. In this work, we introduce UniRL, a self-improving post-training approach. Our approach enables the model to generate images from prompts and use them as training data in each iteration, without relying on any external image data. Moreover, it enables the two tasks to enhance each other: the generated images are used for understanding, and the understanding results are used to supervise generation. We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) to optimize the models. UniRL offers three key advantages: (1) it requires no external image data, as all training samples are generated by the model itself during training; (2) it not only improves individual task performance, but also reduces the imbalance between generation and understanding; and (3) it requires only several additional training steps during the post-training stage. We evaluate UniRL on top of Show-o and Janus, achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and models will be released in https://github.com/showlab/UniRL.

UniRL: 지도 학습과 강화 학습을 통한 통합 멀티모달 모델의 자기 개선

UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning

초록

Support