EchoX：透過回聲訓練緩解語音轉語音大語言模型中的聲學語義鴻溝

摘要

語音到語音的大型語言模型（SLLMs）正吸引越來越多的關注。這些模型源自於基於文本的大型語言模型（LLMs），但SLLMs在知識和推理能力上往往表現出退化。我們假設，這種限制源於當前SLLMs的訓練範式未能彌補特徵表示空間中的聲學-語義差距。為解決這一問題，我們提出了EchoX，它利用語義表示並動態生成語音訓練目標。這種方法整合了聲學和語義學習，使EchoX作為一個語音LLM能夠保持強大的推理能力。實驗結果顯示，EchoX在約六千小時的訓練數據下，在多個基於知識的問答基準上達到了先進的性能。該項目可在https://github.com/FreedomIntelligence/EchoX 獲取。

English

Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.

EchoX：透過回聲訓練緩解語音轉語音大語言模型中的聲學語義鴻溝

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

摘要

Support