超越語言建模：多模態預訓練的探索之路

摘要

視覺世界為推動基礎模型超越語言範疇提供了關鍵軸線。儘管此方向日益受到關注，原生多模態模型的設計空間仍不明朗。我們透過受控的從零開始預訓練實驗提供實證清晰度，在排除語言預訓練干擾的條件下，釐清主導多模態預訓練的關鍵因素。採用Transfusion框架（語言模組使用下一詞預測，視覺模組使用擴散模型），我們對文本、影片、圖文配對乃至動作條件影片等多元數據進行訓練。實驗得出四項核心發現：（i）表徵自編碼器（RAE）憑藉其在視覺理解與生成方面的雙重優勢，可提供最優的統一視覺表徵；（ii）視覺與語言數據具有互補性，能為下游能力產生協同效應；（iii）統一多模態預訓練自然導向世界建模，通用訓練中會湧現相關能力；（iv）專家混合模型（MoE）既能實現高效的多模態擴展，又會自然誘發模態專精化。透過等運算量分析，我們計算出雙模態的擴展律，並揭示擴展不對稱性：視覺的數據需求遠高於語言。我們證明MoE架構可通過提供語言所需的高模型容量，同時滿足視覺的數據密集型特性，從而協調此擴展不對稱性，為真正統一的跨模態模型鋪平道路。

English

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

超越語言建模：多模態預訓練的探索之路

Beyond Language Modeling: An Exploration of Multimodal Pretraining

摘要

Support