X-LLM: Avanzamento dei Modelli Linguistici di Grande Dimensione Trattando le Multi-Modalità come Lingue Straniere

Abstract

I modelli linguistici di grandi dimensioni (LLM) hanno dimostrato capacità linguistiche straordinarie. GPT-4, basato su LLM avanzati, mostra capacità multimodali eccezionali che vanno oltre i precedenti modelli linguistici visivi. Attribuiamo questo all'uso di LLM più avanzati rispetto ai precedenti modelli multimodali. Purtroppo, l'architettura del modello e le strategie di addestramento di GPT-4 non sono note. Per dotare gli LLM di capacità multimodali, proponiamo X-LLM, che converte le multi-modalità (immagini, audio, video) in lingue straniere utilizzando interfacce X2L e le inserisce in un grande modello linguistico (ChatGLM). Nello specifico, X-LLM allinea più encoder mono-modali congelati e un LLM congelato utilizzando interfacce X2L, dove "X" indica le multi-modalità come immagini, audio e video, e "L" indica le lingue. L'addestramento di X-LLM si compone di tre fasi: (1) Conversione delle informazioni multimodali: la prima fase addestra ciascuna interfaccia X2L per allinearsi separatamente al rispettivo encoder mono-modale, convertendo le informazioni multimodali in lingue. (2) Allineamento delle rappresentazioni X2L con l'LLM: gli encoder mono-modali vengono allineati con l'LLM attraverso le interfacce X2L in modo indipendente. (3) Integrazione delle multi-modalità: tutti gli encoder mono-modali vengono allineati con l'LLM attraverso le interfacce X2L per integrare le capacità multimodali nell'LLM. I nostri esperimenti mostrano che X-LLM dimostra impressionanti capacità di chat multimodale, a volte esibendo comportamenti simili a GPT-4 multimodale su immagini/istruzioni non viste, e ottiene un punteggio relativo dell'84,5% rispetto a GPT-4 su un dataset sintetico di istruzioni multimodali. Abbiamo inoltre condotto test quantitativi sull'uso di LLM per il riconoscimento vocale automatico (ASR) e l'ASR multimodale, con l'obiettivo di promuovere l'era del riconoscimento vocale basato su LLM.

English

Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimodal capabilities, we propose X-LLM, which converts Multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and inputs them into a large Language model (ChatGLM). Specifically, X-LLM aligns multiple frozen single-modal encoders and a frozen LLM using X2L interfaces, where ``X'' denotes multi-modalities such as image, speech, and videos, and ``L'' denotes languages. X-LLM's training consists of three stages: (1) Converting Multimodal Information: The first stage trains each X2L interface to align with its respective single-modal encoder separately to convert multimodal information into languages. (2) Aligning X2L representations with the LLM: single-modal encoders are aligned with the LLM through X2L interfaces independently. (3) Integrating multiple modalities: all single-modal encoders are aligned with the LLM through X2L interfaces to integrate multimodal capabilities into the LLM. Our experiments show that X-LLM demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 84.5\% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. And we also conduct quantitative tests on using LLM for ASR and multimodal ASR, hoping to promote the era of LLM-based speech recognition.

X-LLM: Avanzamento dei Modelli Linguistici di Grande Dimensione Trattando le Multi-Modalità come Lingue Straniere

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Abstract

Support