Soundwave：大語言模型中語音-文本對齊的「少即是多」

摘要

現有的端到端語音大語言模型（LLMs）通常依賴大規模的註釋數據進行訓練，而數據高效的訓練方法尚未深入探討。我們聚焦於語音與文本之間的兩個基本問題：表示空間的差距和序列長度不一致性。我們提出了Soundwave，它利用一種高效的訓練策略和新穎的架構來解決這些問題。結果顯示，Soundwave在語音翻譯和AIR-Bench語音任務上超越了先進的Qwen2-Audio，且僅使用了五十分之一的訓練數據。進一步分析表明，Soundwave在對話中仍能保持其智能。該項目可在https://github.com/FreedomIntelligence/Soundwave 獲取。

English

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.

Soundwave：大語言模型中語音-文本對齊的「少即是多」

Soundwave: Less is More for Speech-Text Alignment in LLMs

摘要

Support