언어 모델링을 넘어선 다중모달 사전학습 탐구

초록

시각 세계는 기반 모델을 언어 이상으로 발전시키는 중요한 축을 제공합니다. 이러한 방향에 대한 관심이 증가하고 있지만, 본질적으로 다중모달인 모델의 설계 공간은 여전히 불투명합니다. 우리는 언어 사전학습의 간섭 없이 다중모달 사전학습을 지배하는 요인을 분리하여, 통제된 처음부터의 사전학습 실험을 통해 실증적 명확성을 제공합니다. 우리는 언어에는 다음 토큰 예측을, 비전에는 확산 모델을 사용하는 Transfusion 프레임워크를 채택하여 텍스트, 비디오, 이미지-텍스트 쌍, 심지어 행동 조건화 비디오 등 다양한 데이터로 학습을 진행했습니다. 우리의 실험은 네 가지 핵심 통찰을 도출했습니다: (i) 표현 오토인코더(RAE)는 시각 이해와 생성 모두에서 뛰어나 최적의 통합 시각 표현을 제공합니다; (ii) 시각 및 언어 데이터는 상호 보완적이며 하위 작업 능력에 시너지 효과를 냅니다; (iii) 통합 다중모달 사전학습은 자연스럽게 세계 모델링으로 이어지며, 일반적인 학습 과정에서 다양한 능력이 나타납니다; (iv) 전문가 혼합 모델(MoE)은 효율적이고 효과적인 다중모달 규모 확장을 가능하게 하면서 자연스럽게 모달리티 전문화를 유도합니다. IsoFLOP 분석을 통해 우리는 두 모달리티에 대한 규모 확장 법칙을 계산하고 규모 확장 비대칭성을 발견했습니다: 비전은 언어보다 훨씬 더 많은 데이터를 필요로 합니다. 우리는 MoE 아키텍처가 언어가 요구하는 높은 모델 용량을 제공하면서도 비전의 데이터 집약적 특성을 수용함으로써 이러한 규모 확장 비대칭성을 조화시키며, 진정한 통합 다중모달 모델로 가는 길을 열어준다는 것을 입증합니다.

English

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

언어 모델링을 넘어선 다중모달 사전학습 탐구

Beyond Language Modeling: An Exploration of Multimodal Pretraining

초록

Support