LLaSM: 大規模言語・音声モデル

要旨

マルチモーダル大規模言語モデルは近年、大きな注目を集めています。ただし、これまでの研究の多くは視覚と言語を組み合わせたマルチモーダルモデルに焦点を当てており、視覚と言語の指示に従う強力な能力を提供しています。しかし、私たちは音声も人間が世界と相互作用する重要なモダリティであると主張します。したがって、汎用アシスタントにとって、マルチモーダルな音声と言語の指示に従えることが重要です。本論文では、Large Language and Speech Model (LLaSM)を提案します。LLaSMは、エンドツーエンドで学習された大規模なマルチモーダル音声言語モデルであり、クロスモーダルな会話能力を備え、音声と言語の指示に従うことができます。初期実験では、LLaSMが人間と人工知能の相互作用において、より便利で自然な方法を提供することが示されています。具体的には、大規模な音声指示追従データセットLLaSM-Audio-Instructionsも公開しています。コードとデモはhttps://github.com/LinkSoul-AI/LLaSMおよびhttps://huggingface.co/spaces/LinkSoul/LLaSMで利用可能です。LLaSM-Audio-Instructionsデータセットはhttps://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructionsで入手できます。

English

Multi-modal large language models have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset LLaSM-Audio-Instructions. Code and demo are available at https://github.com/LinkSoul-AI/LLaSM and https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions dataset is available at https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.

LLaSM: 大規模言語・音声モデル

LLaSM: Large Language and Speech Model

要旨

Support