LLaSO：大型語言與語音模型可重現研究的基礎框架

摘要

大型語音-語言模型（LSLMs）的發展因架構分散及透明度不足而受阻，這阻礙了研究的系統性比較與可重現性。與視覺-語言領域不同，LSLM領域普遍存在僅發布模型權重而不提供相應訓練數據與配置的現象。為解決這些關鍵問題，我們推出了LLaSO，首個完全開放、端到端的大規模語音-語言建模框架。LLaSO為社群提供了三大核心資源：(1) LLaSO-Align，一個包含1200萬條語音-文本對齊的語料庫；(2) LLaSO-Instruct，一個包含1350萬條多任務指令微調的數據集；(3) LLaSO-Eval，一個用於標準化評估的可重現基準。為驗證框架的有效性，我們構建並發布了LLaSO-Base，這是一個僅基於我們公開數據訓練的38億參數參考模型，其標準化得分達0.72，建立了一個超越同類模型的強有力且可重現的基準。我們的分析表明，儘管更廣泛的訓練覆蓋提升了性能，但在未見任務上，尤其是在純音頻場景中，仍存在顯著的泛化差距。通過發布完整的數據堆棧、基準與模型，LLaSO確立了一個基礎性的開放標準，旨在統一研究努力並加速LSLMs領域的社群驅動進展。我們在https://github.com/EIT-NLP/LLaSO上公開了代碼、數據集、預訓練模型及結果。

English

The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.

LLaSO：大型語言與語音模型可重現研究的基礎框架

LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

摘要

Support