ChatPaper.aiChatPaper

LLaSM:大型语言和语音模型

LLaSM: Large Language and Speech Model

August 30, 2023
作者: Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi
cs.AI

摘要

最近,多模态大型语言模型引起了广泛关注。然而,大多数研究侧重于视觉-语言多模态模型,在遵循视觉和语言指令方面具有强大的能力。然而,我们认为语音也是人类与世界互动的重要形式。因此,对于一般用途助手而言,能够遵循多模态语音和语言指令至关重要。在这项工作中,我们提出了大型语言和语音模型(LLaSM)。LLaSM是一个端到端训练的大型多模态语音-语言模型,具有跨模态对话能力,能够遵循语音和语言指令。我们早期的实验表明,LLaSM展示了人类与人工智能互动的更便捷和自然方式。具体来说,我们还发布了一个大型语音指令遵循数据集LLaSM-Audio-Instructions。代码和演示可在https://github.com/LinkSoul-AI/LLaSM 和https://huggingface.co/spaces/LinkSoul/LLaSM 找到。LLaSM-Audio-Instructions 数据集可在https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions 找到。
English
Multi-modal large language models have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset LLaSM-Audio-Instructions. Code and demo are available at https://github.com/LinkSoul-AI/LLaSM and https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions dataset is available at https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.
PDF342December 15, 2024