Macaw-LLM: Multi-modale Taalmodellering met Integratie van Beeld, Audio, Video en Tekst

Samenvatting

Hoewel instructie-afgestemde grote taalmodellen (LLMs) opmerkelijke prestaties hebben getoond bij diverse NLP-taken, is hun effectiviteit op andere gegevensmodaliteiten dan tekst nog niet volledig onderzocht. In dit werk stellen we Macaw-LLM voor, een nieuw multi-modale LLM dat visuele, auditieve en tekstuele informatie naadloos integreert. Macaw-LLM bestaat uit drie hoofdcomponenten: een modaliteitsmodule voor het coderen van multi-modale gegevens, een cognitieve module voor het benutten van vooraf getrainde LLMs, en een afstemmingsmodule voor het harmoniseren van diverse representaties. Onze innovatieve afstemmingsmodule verbindt multi-modale kenmerken naadloos met tekstuele kenmerken, wat het aanpassingsproces van de modaliteitsmodules naar de cognitieve module vereenvoudigt. Daarnaast hebben we een grootschalige multi-modale instructiedataset samengesteld in de vorm van multi-turn dialogen, met 69K beeldinstanties en 50K video-instanties. We hebben onze data, code en model openbaar gemaakt, wat hopelijk de weg kan effenen voor toekomstig onderzoek naar multi-modale LLMs en de mogelijkheden van LLMs kan uitbreiden om diverse gegevensmodaliteiten te verwerken en complexe real-world scenario's aan te pakken.

English

Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data modalities beyond text has not been fully studied. In this work, we propose Macaw-LLM, a novel multi-modal LLM that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. Our novel alignment module seamlessly bridges multi-modal features to textual features, simplifying the adaptation process from the modality modules to the cognitive module. In addition, we construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances. We have made our data, code and model publicly available, which we hope can pave the way for future research in multi-modal LLMs and expand the capabilities of LLMs to handle diverse data modalities and address complex real-world scenarios.

Macaw-LLM: Multi-modale Taalmodellering met Integratie van Beeld, Audio, Video en Tekst

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Samenvatting

Support