Ichigo:混合模态早期融合实时语音助手
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
October 20, 2024
作者: Alan Dao, Dinh Bach Vu, Huy Hoang Ha
cs.AI
摘要
大型语言模型(LLMs)已经彻底改变了自然语言处理,但将它们应用于基于语音的任务仍然具有挑战性,因为需要整合音频和文本模态的复杂性。本文介绍了Ichigo,一种混合模态模型,能够无缝处理交错的语音和文本序列。利用一种标记化的早期融合方法,Ichigo将语音量化为离散标记,并采用统一的基于Transformer的架构用于语音和文本模态。这种方法使得跨模态的联合推理和生成成为可能,而无需单独的适配器。我们提出了一套全面的训练方法,包括在多语种语音识别数据集上进行预训练,并在经过精心筛选的指令数据集上进行微调。Ichigo在语音问答基准测试中展现出最先进的性能,优于现有的开源语音语言模型,并且实现了与级联系统相媲美的结果。值得注意的是,Ichigo生成第一个标记的延迟仅为111毫秒,远低于当前模型。我们的方法不仅推动了多模态人工智能领域的发展,还为规模较小的研究团队提供了一个有效贡献开源语音语言模型的框架。
English
Large Language Models (LLMs) have revolutionized natural language processing,
but their application to speech-based tasks remains challenging due to the
complexities of integrating audio and text modalities. This paper introduces
Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of
speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes
speech into discrete tokens and employs a uniform transformer-based
architecture for both speech and text modalities. This method enables joint
reasoning and generation across modalities without the need for separate
adapters. We present a comprehensive training methodology, including
pre-training on multilingual speech recognition datasets and fine-tuning on a
curated instruction dataset. Ichigo demonstrates state-of-the-art performance
on speech question-answering benchmarks, outperforming existing open-source
speech language models and achieving comparable results to cascaded systems.
Notably, Ichigo exhibits a latency of just 111 ms to first token generation,
significantly lower than current models. Our approach not only advances the
field of multimodal AI but also provides a framework for smaller research teams
to contribute effectively to open-source speech-language models.Summary
AI-Generated Summary