ChatPaper.aiChatPaper

AR-Omni:面向任意模态生成任务的统一自回归模型

AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation

January 25, 2026
作者: Dongjie Cheng, Ruifeng Yuan, Yongqi Li, Runyang You, Wenjie Wang, Liqiang Nie, Lei Zhang, Wenjie Li
cs.AI

摘要

现实世界的感知与交互本质上是多模态的,不仅包含语言模态,还涵盖视觉与语音模态,这推动了支持多模态输入与输出的"全能型"MLLMs的发展。尽管已涌现出一系列全能MLLMs,但现有系统大多仍需依赖额外专家组件实现多模态生成,限制了统一训练与推理的简洁性。自回归建模凭借单一令牌流、单一下一令牌目标和单一解码器,在文本领域构成了优雅且可扩展的基础框架。受此启发,我们提出AR-Omni——一种在自回归范式下实现任意模态间转换的统一模型,无需任何专家解码器。该模型通过单一Transformer解码器支持自回归文本与图像生成,以及流式语音生成。我们进一步解决了统一自回归建模中的三个实际问题:通过任务感知的损失重加权应对模态不平衡问题,通过轻量级令牌级感知对齐损失提升图像令牌的视觉保真度,以及通过有限状态解码机制平衡稳定性与创造性。实验表明,AR-Omni在保持实时性的同时实现了三模态的高质量生成,其语音生成的实时因子达到0.88。
English
Real-world perception and interaction are inherently multimodal, encompassing not only language but also vision and speech, which motivates the development of "Omni" MLLMs that support both multimodal inputs and multimodal outputs. While a sequence of omni MLLMs has emerged, most existing systems still rely on additional expert components to achieve multimodal generation, limiting the simplicity of unified training and inference. Autoregressive (AR) modeling, with a single token stream, a single next-token objective, and a single decoder, is an elegant and scalable foundation in the text domain. Motivated by this, we present AR-Omni, a unified any-to-any model in the autoregressive paradigm without any expert decoders. AR-Omni supports autoregressive text and image generation, as well as streaming speech generation, all under a single Transformer decoder. We further address three practical issues in unified AR modeling: modality imbalance via task-aware loss reweighting, visual fidelity via a lightweight token-level perceptual alignment loss for image tokens, and stability-creativity trade-offs via a finite-state decoding mechanism. Empirically, AR-Omni achieves strong quality across three modalities while remaining real-time, achieving a 0.88 real-time factor for speech generation.
PDF71January 28, 2026