ChatPaper.aiChatPaper

Soundwave:大語言模型中語音-文本對齊的「少即是多」

Soundwave: Less is More for Speech-Text Alignment in LLMs

February 18, 2025
作者: Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
cs.AI

摘要

現有的端到端語音大語言模型(LLMs)通常依賴大規模的註釋數據進行訓練,而數據高效的訓練方法尚未深入探討。我們聚焦於語音與文本之間的兩個基本問題:表示空間的差距和序列長度不一致性。我們提出了Soundwave,它利用一種高效的訓練策略和新穎的架構來解決這些問題。結果顯示,Soundwave在語音翻譯和AIR-Bench語音任務上超越了先進的Qwen2-Audio,且僅使用了五十分之一的訓練數據。進一步分析表明,Soundwave在對話中仍能保持其智能。該項目可在https://github.com/FreedomIntelligence/Soundwave 獲取。
English
Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.

Summary

AI-Generated Summary

PDF864February 19, 2025