네이티브 멀티모달 모델링을 향하여: 로드맵

초록

다중 모달 모델링은 모달리티 무관 추론에서 세계 모델링으로 나아가는 중요한 단계를 나타낸다. 초기 접근법은 주로 인코더와 출력 헤드를 갖춘 고정된 언어 백본을 조합하는 후기 융합에 의존했지만, 최근 연구들은 우수한 다중 모달 성능을 위해 모달리티의 본질적 통합을 통한 본질적 다중 모달 모델링(NMM)으로 패러다임을 전환하고 있다. 이러한 잠재력에도 불구하고, 본질적 아키텍처의 설계 공간은 여전히 충분히 정의되지 않은 상태이다. 본 논문에서 우리는 이 전환을 위한 공식화된 로드맵을 학계에 제시한다. 구체적으로, 우리는 아키텍처 본질성을 공식적으로 정의하여 중간 융합 및 초기 융합을 비본질적 패러다임과 구분한다. 또한 기존의 본질적 모델들을 입력-출력 이중성의 관점에서 세 가지 범주로 체계화한다: (i) 교차 모달 이해를 위한 다중-텍스트(텍스트 전용 출력), (ii) 시나리오 지향 생성을 위한 다중-목표(예: 이미지, 오디오 및 비디오 생성), (iii) 대칭적 입력-출력을 통한 통합 모델링을 위한 다중-다중. 우리는 이해와 생성이 통합 트랜스포머 패러다임 내에서 원활하게 공존하는 최종적 NMM 프레임워크로의 전환에 대한 포괄적이고 산업 수준의 조사를 제공한다. 아키텍처 조정, 대규모 데이터 큐레이션, 전체 스택 학습 레시피, 추론 및 배포, 그리고 진정한 본질적 모델링을 위한 포괄적 평가에 이르기까지 산업적 관점에서 종단간 파이프라인을 체계적으로 분석한다.

English

Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.