邁向原生多模態建模：路線圖

摘要

多模態建模是從模態無關推理邁向世界模型建構的關鍵一步。早期方法主要依賴後期融合，將編碼器與凍結的語言骨幹結合輸出頭進行組合；而近期研究已將典範轉移至原生多模態建模，透過本質性地整合各模態以實現更優異的多模態表現。儘管潛力巨大，原生架構的設計空間仍缺乏明確定義。本文為學界提供一條形式化的轉型路線圖。具體而言，我們正式定義架構原生性，區分中期融合與早期融合不同於非原生典範。我們進一步從輸入-輸出二元性的視角，將現有原生模型歸納為三類：(i) 多模態到文本，專注於跨模態理解並僅輸出文本；(ii) 多模態到目標，針對場景導向生成（如影像、音訊與影片生成）；(iii) 多模態到多模態，實現對稱輸入輸出的統一建模。我們針對邁向終極原生多模態建模框架的轉型過程，進行了全面且產業級的調查。在此框架中，理解與生成能在統一的Transformer典範下無縫共存。我們從產業視角系統性地拆解端到端管線，涵蓋架構協調、大規模資料治理、全端訓練配方、推理與部署，以及真正原生建模的全面評測。

English

Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.