Erkunden Sie die Grenzen des Omni-modalen Vortrainings im großen Maßstab.

papers.abstract

Wir schlagen vor, eine omni-modale Intelligenz aufzubauen, die in der Lage ist, jede Modalität zu verstehen und universelle Repräsentationen zu erlernen. Konkret schlagen wir ein skalierbares Vortrainingsparadigma namens Multimodal Context (MiCo) vor, das die Anzahl der Modalitäten und die Datenmenge sowie die Modellparameter im Vortrainingsprozess skalieren kann. Mit MiCo zeigen die vortrainierten Modelle signifikante emergente Fähigkeiten im multimodalen Lernen, die anhand der folgenden Aufgaben evaluiert werden: i) Einzelmodalitäts-Wahrnehmungs-Benchmarks von 10 verschiedenen Modalitäten, ii) 25 Kreismodalitäts-Verständnisaufgaben wie Retrieval, Frage-Antwort, Beschriftung und iii) 18 multimodale große Sprachmodell-Benchmarks. Unsere Modelle stellen 37 neue Rekorde für Spitzenleistungen auf. Wir hoffen, dass unsere Forschung zur Entwicklung einer omni-modalen Intelligenz beitragen könnte. Code und Modelle sind unter https://github.com/invictus717/MiCo verfügbar.

English

We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at https://github.com/invictus717/MiCo

Erkunden Sie die Grenzen des Omni-modalen Vortrainings im großen Maßstab.

Explore the Limits of Omni-modal Pretraining at Scale

papers.abstract

Support