ARM:一種具有統一離散表示的自迴歸大型多模態模型
ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations
June 9, 2026
作者: Junke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu, Feng Li, Jingxiang Sun, Chaorui Deng, Zilong Chen, Yunpeng Chen, Kaibin Tian, Matthew Gwilliam, Hao Chen, Danhui Guan, Kun Xu, Weilin Huang, Zuxuan Wu, Haoqi Fan, Yu-Gang Jiang, Zhenheng Yang
cs.AI
摘要
本文介紹了ARM,一種基於離散表示的自迴歸模型,將圖像理解、生成和編輯統一在一個下一個詞元預測框架中。ARM建立在三個努力之上:首先,我們訓練了一個離散語義視覺詞元化器,將圖像映射為緊湊的詞元序列。我們的詞元化器通過多個目標進行監督,共同促進語義可辨識性、語言對齊和忠實重建,從而在共享的潛在空間中支持多樣化的任務。在此基礎上,我們在大規模文本和圖像詞元序列上訓練了一個7B參數的自迴歸模型,無縫發展出視覺語言感知和生成能力。最後,為了進一步提升文本到圖像生成和指令引導編輯的偏好對齊行為,ARM應用強化學習來優化任務級目標,如視覺品質、指令遵循度和編輯一致性。令人驚訝的是,結果顯示強化學習不僅顯著提高了目標任務的性能(例如,將WISE總分從0.50提升至0.56,GEdit-Bench-EN的G_O從5.75提升至6.68),還引發了文本到圖像生成與編輯之間的跨任務協同效應。總體而言,這些發現強調了自迴歸建模——當與強大的表示和偏好優化相結合時——可以作為多模態智能的可擴展基礎。代碼:https://github.com/wdrink/ARM。
English
This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token sequences. Our tokenizer is supervised with multiple objectives that jointly promote semantic discriminability, language alignment and faithful reconstruction, thereby supporting diverse tasks in a shared latent space. With this, we train a 7B autoregressive model over large-scale text and image token sequences, seamlessly developing vision-language perception and generation capabilities. Finally, to further improve preference-aligned behavior for text-to-image generation and instruction-guided editing, ARM applies reinforcement learning (RL) to optimize task-level objectives such as visual quality, instruction adherence, and edit consistency. Surprisingly, the results show that RL not only substantially improves performance on the target tasks (e.g., raising WISE overall from 0.50 to 0.56, GEdit-Bench-EN G_O from 5.75 to 6.68), but also induces cross-task synergy between text-to-image generation and editing. Collectively, these findings highlight autoregressive modeling, when paired with strong representations and preference optimization, as a scalable foundation for multimodal intelligence. Code: https://github.com/wdrink/ARM.