AnyGPT: 離散シーケンスモデリングによる統一マルチモーダルLLM

要旨

我々はAnyGPTを紹介する。これは、音声、テキスト、画像、音楽など様々なモダリティを統一処理するための離散表現を利用したany-to-anyマルチモーダル言語モデルである。AnyGPTは、現在の大規模言語モデル（LLM）のアーキテクチャやトレーニングパラダイムを変更することなく、安定してトレーニングすることができる。代わりに、データレベルの前処理にのみ依存し、新しいモダリティをLLMにシームレスに統合することを可能にする。これは、新しい言語を組み込むのと同様の方法である。我々は、マルチモーダルアライメントの事前トレーニングのためのマルチモーダルテキスト中心のデータセットを構築した。生成モデルを利用して、最初の大規模なany-to-anyマルチモーダル指示データセットを合成した。これは、様々なモダリティを複雑に織り交ぜた10万8千の多ターン会話サンプルからなり、モデルが任意のマルチモーダル入力と出力の組み合わせを処理できるようにする。実験結果は、AnyGPTがany-to-anyマルチモーダル会話を促進し、すべてのモダリティにおいて専門モデルに匹敵する性能を達成できることを示している。これは、離散表現が言語モデル内で複数のモダリティを効果的かつ便利に統一できることを証明している。デモはhttps://junzhan2000.github.io/AnyGPT.github.io/で公開されている。

English

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

AnyGPT: 離散シーケンスモデリングによる統一マルチモーダルLLM

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

要旨

Support