LLaSM：大型語言和語音模型

摘要

最近，多模式大型語言模型引起了相當大的興趣。然而，大部分研究專注於視覺-語言多模式模型，提供強大的能力來遵循視覺和語言指示。然而，我們主張語音也是人類與世界互動的重要模式之一。因此，對於一個通用助理來說，能夠遵循多模式語音和語言指示至關重要。在這項工作中，我們提出了大型語言和語音模型（LLaSM）。LLaSM是一個端對端訓練的大型多模式語音語言模型，具有跨模式對話能力，能夠遵循語音和語言指示。我們的早期實驗表明，LLaSM展示了一種更方便和自然的方式，讓人類與人工智能互動。具體來說，我們還釋出了一個大型語音指示跟隨數據集LLaSM-Audio-Instructions。代碼和演示可在以下網址找到：https://github.com/LinkSoul-AI/LLaSM 和 https://huggingface.co/spaces/LinkSoul/LLaSM。LLaSM-Audio-Instructions數據集可在以下網址找到：https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions。

English

Multi-modal large language models have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset LLaSM-Audio-Instructions. Code and demo are available at https://github.com/LinkSoul-AI/LLaSM and https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions dataset is available at https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.

LLaSM：大型語言和語音模型

LLaSM: Large Language and Speech Model

摘要

Support