4M-21：一個適用於數十個任務和模態的任意到任意視覺模型

摘要

目前的多模態和多任務基礎模型，如4M或UnifiedIO，展示了令人鼓舞的結果，但在實踐中，它們開箱即用的能力接受多樣輸入並執行多樣任務的能力受到限制，這是由它們訓練的模態和任務數量（通常相當少）所決定的。在本文中，我們通過在大量多模態數據集和文本語料庫上進行聯合訓練，擴展了它們的能力，通過在數十種高度多樣的模態上訓練單一模型，執行共同訓練。這包括在幾個語義和幾何模態上進行訓練，來自最新技術模型（如DINOv2和ImageBind）的特徵圖，像SAM和4DHumans這樣的專家模型的虛擬標籤，以及一系列新的模態，允許以新穎的方式與模型互動並引導生成，例如圖像元數據或調色板。這個過程中的一個關鍵步驟是對各種模態執行離散標記化，無論它們是類似圖像的、神經網絡特徵圖、向量、結構化數據（如實例分割或人體姿勢）或可以表示為文本的數據。通過這一過程，我們擴展了多模態模型的開箱即用能力，並具體展示了訓練一個模型來解決至少比現有模型多3倍的任務/模態的可能性，而且在性能上不會有損失。這使得多模態生成能力更加精細和可控，並使我們能夠研究在多樣數據和目標上訓練的模型的提煉成統一模型。我們成功將訓練擴展到一個擁有三十億參數的模型，使用數十種模態和不同數據集。生成的模型和訓練代碼在4m.epfl.ch上開源。

English

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.

4M-21：一個適用於數十個任務和模態的任意到任意視覺模型

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

摘要

Support