ARM: 統一離散表現を用いた自己回帰型大規模マルチモーダルモデル

要旨

本論文では、次トークン予測フレームワーク内で画像理解、生成、編集を統合する、離散表現に基づく自己回帰モデルARMを紹介する。ARMは3つの取り組みに基づいている。第一に、画像をコンパクトなトークン系列に変換する離散意味的視覚トークナイザーを訓練する。本トークナイザーは、意味的識別性、言語アライメント、忠実な再構成を共同で促進する複数の目的で教師あり学習されており、共有潜在空間において多様なタスクをサポートする。これを用いて、大規模なテキストおよび画像トークン系列に対して7Bの自己回帰モデルを訓練し、視覚言語の知覚能力と生成能力をシームレスに発展させる。最後に、テキストから画像生成および指示誘導編集における嗜好整合行動をさらに改善するため、ARMは強化学習(RL)を適用して、視覚品質、指示遵守、編集一貫性といったタスクレベルの目的を最適化する。驚くべきことに、結果はRLが対象タスクの性能を大幅に向上させるだけでなく（例：WISE全体スコアを0.50から0.56に、GEdit-Bench-ENのG_Oを5.75から6.68に向上）、テキストから画像生成と編集の間のタスク間相乗効果も誘発することを示している。総じて、これらの発見は、強力な表現と嗜好最適化と組み合わせた場合の自己回帰モデリングが、マルチモーダル知能のためのスケーラブルな基盤であることを浮き彫りにしている。コード: https://github.com/wdrink/ARM。

English

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.