LLaSO：大规模语言与语音模型可复现性研究的基础框架

摘要

大规模语音语言模型（LSLMs）的发展因架构分散和透明度不足而受到阻碍，这影响了研究的系统性比较与可复现性。与视觉语言领域不同，LSLM领域普遍存在仅发布模型权重而缺乏相应训练数据和配置的做法。为解决这些关键问题，我们推出了LLaSO，这是首个完全开放、端到端的大规模语音语言建模框架。LLaSO为社区提供了三项核心资源：（1）LLaSO-Align，一个包含1200万条实例的语音文本对齐语料库；（2）LLaSO-Instruct，一个包含1350万条实例的多任务指令微调数据集；以及（3）LLaSO-Eval，一个用于标准化评估的可复现基准。为验证该框架，我们构建并发布了LLaSO-Base，这是一个仅基于我们公开数据训练的38亿参数参考模型，其标准化得分达到0.72，确立了一个超越同类模型的强大且可复现的基线。我们的分析表明，尽管更广泛的训练覆盖范围提升了性能，但在未见任务上，尤其是纯音频场景中，仍存在显著的泛化差距。通过发布完整的数据集、基准和模型栈，LLaSO建立了一个基础性的开放标准，以统一研究努力并加速LSLMs领域的社区驱动进展。我们已在https://github.com/EIT-NLP/LLaSO上公开了代码、数据集、预训练模型及结果。

English

The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.

LLaSO：大规模语言与语音模型可复现性研究的基础框架

LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

摘要

Support