EchoX：通过回声训练缓解语音到语音大语言模型中的声学-语义鸿沟

摘要

语音到语音大语言模型（SLLMs）正日益受到关注。源自于文本大语言模型（LLMs）的SLLMs，往往在知识和推理能力上表现出退化。我们推测，这一局限源于当前SLLMs的训练范式未能弥合特征表示空间中的声学-语义鸿沟。为解决此问题，我们提出了EchoX，它利用语义表示并动态生成语音训练目标。该方法融合了声学与语义学习，使EchoX作为语音LLM能够保持强大的推理能力。实验结果显示，EchoX在约六千小时的训练数据基础上，在多个基于知识的问答基准测试中取得了领先性能。项目详情可访问https://github.com/FreedomIntelligence/EchoX。

English

Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.

EchoX：通过回声训练缓解语音到语音大语言模型中的声学-语义鸿沟

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

摘要

Support