4M-21：一种适用于数十种任务和模态的任意-任意视觉模型

摘要

当前的多模态和多任务基础模型，如4M或UnifiedIO，展现出令人期待的结果，但实际上，它们的开箱即用能力受限于它们训练的模态和任务数量（通常相对较少），难以接受多样化输入并执行多样化任务。在本文中，我们通过在大量多样化模态和大规模多模态数据集以及文本语料库上进行联合训练，进一步拓展了它们的能力。这包括在几个语义和几何模态上进行训练，使用最新的DINOv2和ImageBind等先进模型的特征图，SAM和4DHumans等专家模型的伪标签，以及一系列新的模态，允许以新颖方式与模型进行交互并引导生成，例如图像元数据或调色板。这个过程中的一个关键步骤是对各种模态进行离散标记化，无论它们是类似图像的、神经网络特征图、向量、结构化数据（如实例分割或人体姿势）还是可表示为文本的数据。通过这一步骤，我们拓展了多模态模型的开箱即用能力，特别展示了训练一个模型来解决至少比现有模型多3倍的任务/模态的可能性，并且在不降低性能的情况下实现。这使得更加精细和可控的多模态生成能力成为可能，并使我们能够研究在多样化数据和目标上训练的模型如何融合为一个统一模型。我们成功将训练规模扩展到一个30亿参数的模型，使用了数十种模态和不同数据集。生成的模型和训练代码在4m.epfl.ch上开源。

English

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.

4M-21：一种适用于数十种任务和模态的任意-任意视觉模型

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

摘要

Support