LLaSM:大型語言和語音模型
LLaSM: Large Language and Speech Model
August 30, 2023
作者: Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi
cs.AI
摘要
最近,多模式大型語言模型引起了相當大的興趣。然而,大部分研究專注於視覺-語言多模式模型,提供強大的能力來遵循視覺和語言指示。然而,我們主張語音也是人類與世界互動的重要模式之一。因此,對於一個通用助理來說,能夠遵循多模式語音和語言指示至關重要。在這項工作中,我們提出了大型語言和語音模型(LLaSM)。LLaSM是一個端對端訓練的大型多模式語音語言模型,具有跨模式對話能力,能夠遵循語音和語言指示。我們的早期實驗表明,LLaSM展示了一種更方便和自然的方式,讓人類與人工智能互動。具體來說,我們還釋出了一個大型語音指示跟隨數據集LLaSM-Audio-Instructions。代碼和演示可在以下網址找到:https://github.com/LinkSoul-AI/LLaSM 和 https://huggingface.co/spaces/LinkSoul/LLaSM。LLaSM-Audio-Instructions數據集可在以下網址找到:https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions。
English
Multi-modal large language models have garnered significant interest
recently. Though, most of the works focus on vision-language multi-modal models
providing strong capabilities in following vision-and-language instructions.
However, we claim that speech is also an important modality through which
humans interact with the world. Hence, it is crucial for a general-purpose
assistant to be able to follow multi-modal speech-and-language instructions. In
this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an
end-to-end trained large multi-modal speech-language model with cross-modal
conversational abilities, capable of following speech-and-language
instructions. Our early experiments show that LLaSM demonstrates a more
convenient and natural way for humans to interact with artificial intelligence.
Specifically, we also release a large Speech Instruction Following dataset
LLaSM-Audio-Instructions. Code and demo are available at
https://github.com/LinkSoul-AI/LLaSM and
https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions
dataset is available at
https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.