LLaSM：大型语言和语音模型

摘要

最近，多模态大型语言模型引起了广泛关注。然而，大多数研究侧重于视觉-语言多模态模型，在遵循视觉和语言指令方面具有强大的能力。然而，我们认为语音也是人类与世界互动的重要形式。因此，对于一般用途助手而言，能够遵循多模态语音和语言指令至关重要。在这项工作中，我们提出了大型语言和语音模型（LLaSM）。LLaSM是一个端到端训练的大型多模态语音-语言模型，具有跨模态对话能力，能够遵循语音和语言指令。我们早期的实验表明，LLaSM展示了人类与人工智能互动的更便捷和自然方式。具体来说，我们还发布了一个大型语音指令遵循数据集LLaSM-Audio-Instructions。代码和演示可在https://github.com/LinkSoul-AI/LLaSM 和https://huggingface.co/spaces/LinkSoul/LLaSM 找到。LLaSM-Audio-Instructions 数据集可在https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions 找到。

English

Multi-modal large language models have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset LLaSM-Audio-Instructions. Code and demo are available at https://github.com/LinkSoul-AI/LLaSM and https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions dataset is available at https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.

LLaSM：大型语言和语音模型

LLaSM: Large Language and Speech Model

摘要

Support