EchoX:通过回声训练缓解语音到语音大语言模型中的声学-语义鸿沟
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
September 11, 2025
作者: Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li
cs.AI
摘要
语音到语音大语言模型(SLLMs)正日益受到关注。源自于文本大语言模型(LLMs)的SLLMs,往往在知识和推理能力上表现出退化。我们推测,这一局限源于当前SLLMs的训练范式未能弥合特征表示空间中的声学-语义鸿沟。为解决此问题,我们提出了EchoX,它利用语义表示并动态生成语音训练目标。该方法融合了声学与语义学习,使EchoX作为语音LLM能够保持强大的推理能力。实验结果显示,EchoX在约六千小时的训练数据基础上,在多个基于知识的问答基准测试中取得了领先性能。项目详情可访问https://github.com/FreedomIntelligence/EchoX。
English
Speech-to-speech large language models (SLLMs) are attracting increasing
attention. Derived from text-based large language models (LLMs), SLLMs often
exhibit degradation in knowledge and reasoning capabilities. We hypothesize
that this limitation arises because current training paradigms for SLLMs fail
to bridge the acoustic-semantic gap in the feature representation space. To
address this issue, we propose EchoX, which leverages semantic representations
and dynamically generates speech training targets. This approach integrates
both acoustic and semantic learning, enabling EchoX to preserve strong
reasoning abilities as a speech LLM. Experimental results demonstrate that
EchoX, with about six thousand hours of training data, achieves advanced
performance on multiple knowledge-based question-answering benchmarks. The
project is available at https://github.com/FreedomIntelligence/EchoX.