LLaSM:大型语言和语音模型
LLaSM: Large Language and Speech Model
August 30, 2023
作者: Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi
cs.AI
摘要
最近,多模态大型语言模型引起了广泛关注。然而,大多数研究侧重于视觉-语言多模态模型,在遵循视觉和语言指令方面具有强大的能力。然而,我们认为语音也是人类与世界互动的重要形式。因此,对于一般用途助手而言,能够遵循多模态语音和语言指令至关重要。在这项工作中,我们提出了大型语言和语音模型(LLaSM)。LLaSM是一个端到端训练的大型多模态语音-语言模型,具有跨模态对话能力,能够遵循语音和语言指令。我们早期的实验表明,LLaSM展示了人类与人工智能互动的更便捷和自然方式。具体来说,我们还发布了一个大型语音指令遵循数据集LLaSM-Audio-Instructions。代码和演示可在https://github.com/LinkSoul-AI/LLaSM 和https://huggingface.co/spaces/LinkSoul/LLaSM 找到。LLaSM-Audio-Instructions 数据集可在https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions 找到。
English
Multi-modal large language models have garnered significant interest
recently. Though, most of the works focus on vision-language multi-modal models
providing strong capabilities in following vision-and-language instructions.
However, we claim that speech is also an important modality through which
humans interact with the world. Hence, it is crucial for a general-purpose
assistant to be able to follow multi-modal speech-and-language instructions. In
this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an
end-to-end trained large multi-modal speech-language model with cross-modal
conversational abilities, capable of following speech-and-language
instructions. Our early experiments show that LLaSM demonstrates a more
convenient and natural way for humans to interact with artificial intelligence.
Specifically, we also release a large Speech Instruction Following dataset
LLaSM-Audio-Instructions. Code and demo are available at
https://github.com/LinkSoul-AI/LLaSM and
https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions
dataset is available at
https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.