LLaSM: Modelo de Linguagem e Fala em Grande Escala

Resumo

Modelos de linguagem multimodal de grande escala têm despertado um interesse significativo recentemente. No entanto, a maioria dos trabalhos se concentra em modelos multimodais de visão e linguagem, oferecendo capacidades robustas para seguir instruções que envolvem visão e linguagem. Contudo, afirmamos que a fala também é uma modalidade importante por meio da qual os humanos interagem com o mundo. Portanto, é crucial que um assistente de propósito geral seja capaz de seguir instruções multimodais que envolvem fala e linguagem. Neste trabalho, propomos o Large Language and Speech Model (LLaSM). O LLaSM é um modelo multimodal de grande escala treinado de ponta a ponta com habilidades conversacionais entre modalidades, capaz de seguir instruções que envolvem fala e linguagem. Nossos experimentos iniciais mostram que o LLaSM oferece uma maneira mais conveniente e natural para os humanos interagirem com a inteligência artificial. Especificamente, também lançamos um grande conjunto de dados de Seguimento de Instruções de Fala, chamado LLaSM-Audio-Instructions. O código e a demonstração estão disponíveis em https://github.com/LinkSoul-AI/LLaSM e https://huggingface.co/spaces/LinkSoul/LLaSM. O conjunto de dados LLaSM-Audio-Instructions está disponível em https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.

English

Multi-modal large language models have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-language model with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset LLaSM-Audio-Instructions. Code and demo are available at https://github.com/LinkSoul-AI/LLaSM and https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions dataset is available at https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.

LLaSM: Modelo de Linguagem e Fala em Grande Escala

LLaSM: Large Language and Speech Model

Resumo

Support