シンプルさのスケーラビリティ：単一Transformerを用いた視覚-言語学習の実証分析

要旨

本論文では、生のピクセルエンコーディングと言語デコーディングを単一のアーキテクチャ内に統合した、単一トランスフォーマーによる統一マルチモーダル大規模言語モデル（MLLM）であるSAILを紹介する。既存のモジュール型MLLMが事前学習済みのビジョントランスフォーマー（ViT）に依存しているのに対し、SAILは別個のビジョンエンコーダを必要とせず、よりミニマルなアーキテクチャ設計を実現している。新たなアーキテクチャコンポーネントを導入する代わりに、SAILはミックスアテンション機構とマルチモーダル位置エンコーディングを適応させ、視覚的およびテキスト的モダリティの異なる特性により良く適合させている。我々は、SAILのスケーラビリティ、クロスモーダル情報フローパターン、視覚表現能力といった特性を、モジュール型MLLMと体系的に比較した。トレーニングデータとモデルサイズの両方をスケールアップすることで、SAILはモジュール型MLLMに匹敵する性能を達成している。特に、事前学習済みViTコンポーネントの除去により、SAILのスケーラビリティが向上し、クロスモーダル情報フローパターンが大きく異なる結果となった。さらに、SAILは強力な視覚表現能力を示し、セマンティックセグメンテーションなどの視覚タスクにおいてViT-22Bと同等の結果を達成している。コードとモデルはhttps://github.com/bytedance/SAILで公開されている。

English

This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.

シンプルさのスケーラビリティ：単一Transformerを用いた視覚-言語学習の実証分析

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

要旨

Support