4M-21：数十のタスクとモダリティに対応する汎用ビジョンモデル

要旨

現在の4MやUnifiedIOのようなマルチモーダル・マルチタスク基盤モデルは有望な結果を示していますが、実際には、多様な入力を受け入れ多様なタスクを実行するその即戦力は、通常比較的少数のモダリティとタスクに限定されて訓練されていることに制約されています。本論文では、数十の非常に多様なモダリティで単一のモデルを訓練し、大規模なマルチモーダルデータセットとテキストコーパスで共同訓練を行うことで、これらの能力を拡張します。これには、いくつかの意味的および幾何学的モダリティ、DINOv2やImageBindのような最新の最先端モデルからの特徴マップ、SAMや4DHumansのような専門家モデルの擬似ラベル、画像メタデータやカラーパレットなど、モデルとの新しい相互作用方法と生成の制御を可能にする一連の新しいモダリティが含まれます。このプロセスにおける重要なステップは、画像のようなもの、ニューラルネットワークの特徴マップ、ベクトル、インスタンスセグメンテーションや人間のポーズのような構造化データ、またはテキストとして表現可能なデータなど、さまざまなモダリティに対して離散的なトークン化を実行することです。これにより、マルチモーダルモデルの即戦力を拡張し、特に既存のモデルよりも少なくとも3倍以上のタスク/モダリティを解決する1つのモデルを訓練する可能性を示し、性能の低下なしにそれを実現します。これにより、より細かく制御可能なマルチモーダル生成能力が可能になり、多様なデータと目的で訓練されたモデルを統一モデルに蒸留する研究が可能になります。私たちは、数十のモダリティと異なるデータセットを使用して、30億パラメータのモデルの訓練を成功裏にスケールアップしました。結果として得られたモデルと訓練コードは、4m.epfl.chでオープンソースとして公開されています。

English

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.

4M-21：数十のタスクとモダリティに対応する汎用ビジョンモデル

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

要旨

Support