Soundwave:大語言模型中語音-文本對齊的「少即是多」
Soundwave: Less is More for Speech-Text Alignment in LLMs
February 18, 2025
作者: Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
cs.AI
摘要
現有的端到端語音大語言模型(LLMs)通常依賴大規模的註釋數據進行訓練,而數據高效的訓練方法尚未深入探討。我們聚焦於語音與文本之間的兩個基本問題:表示空間的差距和序列長度不一致性。我們提出了Soundwave,它利用一種高效的訓練策略和新穎的架構來解決這些問題。結果顯示,Soundwave在語音翻譯和AIR-Bench語音任務上超越了先進的Qwen2-Audio,且僅使用了五十分之一的訓練數據。進一步分析表明,Soundwave在對話中仍能保持其智能。該項目可在https://github.com/FreedomIntelligence/Soundwave 獲取。
English
Existing end-to-end speech large language models (LLMs) usually rely on
large-scale annotated data for training, while data-efficient training has not
been discussed in depth. We focus on two fundamental problems between speech
and text: the representation space gap and sequence length inconsistency. We
propose Soundwave, which utilizes an efficient training strategy and a novel
architecture to address these issues. Results show that Soundwave outperforms
the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks,
using only one-fiftieth of the training data. Further analysis shows that
Soundwave still retains its intelligence during conversation. The project is
available at https://github.com/FreedomIntelligence/Soundwave.Summary
AI-Generated Summary