EchoX:透過回聲訓練緩解語音轉語音大語言模型中的聲學語義鴻溝
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
September 11, 2025
作者: Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li
cs.AI
摘要
語音到語音的大型語言模型(SLLMs)正吸引越來越多的關注。這些模型源自於基於文本的大型語言模型(LLMs),但SLLMs在知識和推理能力上往往表現出退化。我們假設,這種限制源於當前SLLMs的訓練範式未能彌補特徵表示空間中的聲學-語義差距。為解決這一問題,我們提出了EchoX,它利用語義表示並動態生成語音訓練目標。這種方法整合了聲學和語義學習,使EchoX作為一個語音LLM能夠保持強大的推理能力。實驗結果顯示,EchoX在約六千小時的訓練數據下,在多個基於知識的問答基準上達到了先進的性能。該項目可在https://github.com/FreedomIntelligence/EchoX 獲取。
English
Speech-to-speech large language models (SLLMs) are attracting increasing
attention. Derived from text-based large language models (LLMs), SLLMs often
exhibit degradation in knowledge and reasoning capabilities. We hypothesize
that this limitation arises because current training paradigms for SLLMs fail
to bridge the acoustic-semantic gap in the feature representation space. To
address this issue, we propose EchoX, which leverages semantic representations
and dynamically generates speech training targets. This approach integrates
both acoustic and semantic learning, enabling EchoX to preserve strong
reasoning abilities as a speech LLM. Experimental results demonstrate that
EchoX, with about six thousand hours of training data, achieves advanced
performance on multiple knowledge-based question-answering benchmarks. The
project is available at https://github.com/FreedomIntelligence/EchoX.