EchoX: 음향-의미 간극 완화를 위한 에코 훈련 기반 음성-음성 대형 언어 모델 연구

초록

음성-음성 대형 언어 모델(SLLMs)이 점점 더 많은 관심을 받고 있습니다. 텍스트 기반 대형 언어 모델(LLMs)에서 파생된 SLLMs는 종종 지식과 추론 능력의 저하를 보입니다. 우리는 이러한 한계가 현재 SLLMs의 훈련 패러다임이 특징 표현 공간에서의 음향-의미 간극을 해결하지 못하기 때문에 발생한다고 가정합니다. 이 문제를 해결하기 위해, 우리는 의미 표현을 활용하고 동적으로 음성 훈련 목표를 생성하는 EchoX를 제안합니다. 이 접근 방식은 음향과 의미 학습을 통합하여 EchoX가 음성 LLM으로서 강력한 추론 능력을 유지할 수 있도록 합니다. 실험 결과는 약 6천 시간의 훈련 데이터를 사용한 EchoX가 여러 지식 기반 질의응답 벤치마크에서 우수한 성능을 달성함을 보여줍니다. 이 프로젝트는 https://github.com/FreedomIntelligence/EchoX에서 확인할 수 있습니다.

English

Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.

EchoX: 음향-의미 간극 완화를 위한 에코 훈련 기반 음성-음성 대형 언어 모델 연구

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

초록

Support