迈向原生多模态建模：路线图

摘要

多模态建模代表了一条从模态无关推理迈向世界建模的关键路径。早期方法主要依赖后期融合——将编码器、冻结语言主干与输出头进行组合——而近期研究已将范式转向原生多模态建模（NMM），通过模态的内在融合实现更优的多模态性能。尽管潜力巨大，原生架构的设计空间仍缺乏充分定义。本文向学界呈现了一条形式化的转型路线图。具体而言，我们首次明确界定了架构原生性，从非原生范式中区分出中期融合与早期融合。我们进一步基于输入-输出对偶性，将现有原生模型分为三类：(i) 多对文本——面向仅输出文本的跨模态理解；(ii) 多对目标——面向场景化生成（如图像、音频和视频生成）；(iii) 多对多——面向输入输出对称的统一建模。我们针对向终极NMM框架的转型过程，开展了全面且具备工业级深度的研究——在该框架下，理解与生成在统一变换器范式中无缝共存。我们从工业视角系统拆解端到端流水线，涵盖架构协同、大规模数据整理、全栈训练方案、推理与部署，以及真正原生建模的综合评估体系。

English

Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.