4M-21: Een any-to-any visiemodel voor tientallen taken en modaliteiten

Samenvatting

Huidige multimodale en multitask foundation-modellen zoals 4M of UnifiedIO laten veelbelovende resultaten zien, maar in de praktijk worden hun out-of-the-box mogelijkheden om diverse inputs te accepteren en diverse taken uit te voeren beperkt door het (meestal vrij kleine) aantal modaliteiten en taken waarop ze getraind zijn. In dit artikel breiden we de mogelijkheden van deze modellen uit door één model te trainen op tientallen zeer diverse modaliteiten en door co-training uit te voeren op grootschalige multimodale datasets en tekstcorpora. Dit omvat training op verschillende semantische en geometrische modaliteiten, feature maps van recente state-of-the-art modellen zoals DINOv2 en ImageBind, pseudo-labels van gespecialiseerde modellen zoals SAM en 4DHumans, en een reeks nieuwe modaliteiten die nieuwe manieren bieden om met het model te interacteren en de generatie te sturen, bijvoorbeeld beeldmetadata of kleurenpaletten. Een cruciale stap in dit proces is het uitvoeren van discrete tokenisatie op verschillende modaliteiten, of het nu gaat om beeldachtige data, feature maps van neurale netwerken, vectoren, gestructureerde data zoals instance segmentation of menselijke poses, of data die als tekst kunnen worden weergegeven. Hiermee breiden we de out-of-the-box mogelijkheden van multimodale modellen uit en tonen we specifiek de mogelijkheid aan om één model te trainen om minstens 3x meer taken/modaliteiten op te lossen dan bestaande modellen, en dit te doen zonder verlies van prestaties. Dit maakt meer fijnmazige en controleerbare multimodale generatiemogelijkheden mogelijk en stelt ons in staat om de destillatie van modellen die op diverse data en doelen zijn getraind, te bestuderen in één verenigd model. We schalen de training succesvol op naar een model met drie miljard parameters met behulp van tientallen modaliteiten en verschillende datasets. De resulterende modellen en trainingscode zijn open source beschikbaar op 4m.epfl.ch.

English

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.

4M-21: Een any-to-any visiemodel voor tientallen taken en modaliteiten

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Samenvatting

Support