VARGPT-v1.1: 反復的な指示チューニングと強化学習による視覚的自回帰大規模統合モデルの改善

要旨

本研究では、従来のフレームワークVARGPTを発展させた高度な統合型視覚自己回帰モデルVARGPT-v1.1を提案する。本モデルは、視覚理解のための次トークン予測と画像合成のための次スケール生成という二重のパラダイムを維持している。具体的には、VARGPT-v1.1は以下の要素を統合している：(1) 反復的視覚指示チューニングとDirect Preference Optimization (DPO) による強化学習を組み合わせた新たなトレーニング戦略、(2) 830万の視覚生成指示ペアを含む拡張トレーニングコーパス、(3) Qwen2を使用したアップグレードされた言語モデルバックボーン、(4) 強化された画像生成解像度、(5) アーキテクチャ変更なしで実現された新たな画像編集機能。これらの進化により、VARGPT-v1.1はマルチモーダル理解とテキストから画像への指示追従タスクにおいて最先端の性能を達成し、理解と生成の両方の指標で大幅な改善を示している。特に、視覚指示チューニングを通じて、モデルは前身モデルとのアーキテクチャ的一貫性を維持しつつ画像編集機能を獲得し、統合された視覚理解、生成、編集の可能性を明らかにしている。我々の研究結果は、適切に設計された統合型視覚自己回帰モデルが大規模言語モデル（LLM）からの柔軟なトレーニング戦略を効果的に採用し、有望なスケーラビリティを示すことを示唆している。コードベースとモデルウェイトはhttps://github.com/VARGPT-family/VARGPT-v1.1で公開されている。

English

In this work, we present VARGPT-v1.1, an advanced unified visual autoregressive model that builds upon our previous framework VARGPT. The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis. Specifically, VARGPT-v1.1 integrates: (1) a novel training strategy combining iterative visual instruction tuning with reinforcement learning through Direct Preference Optimization (DPO), (2) an expanded training corpus containing 8.3M visual-generative instruction pairs, (3) an upgraded language model backbone using Qwen2, (4) enhanced image generation resolution, and (5) emergent image editing capabilities without architectural modifications. These advancements enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks, demonstrating significant improvements in both comprehension and generation metrics. Notably, through visual instruction tuning, the model acquires image editing functionality while maintaining architectural consistency with its predecessor, revealing the potential for unified visual understanding, generation, and editing. Our findings suggest that well-designed unified visual autoregressive models can effectively adopt flexible training strategies from large language models (LLMs), exhibiting promising scalability. The codebase and model weights are publicly available at https://github.com/VARGPT-family/VARGPT-v1.1.

VARGPT-v1.1: 反復的な指示チューニングと強化学習による視覚的自回帰大規模統合モデルの改善

VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning

要旨

Support