LLaSO: 대규모 언어 및 음성 모델 연구의 재현 가능성을 위한 기초 프레임워크

초록

대규모 음성-언어 모델(Large Speech-Language Models, LSLMs)의 개발은 파편화된 아키텍처와 투명성 부족으로 인해 지연되어 왔으며, 이는 연구의 체계적인 비교와 재현성을 저해해 왔다. 비전-언어 분야와 달리, LSLM 분야는 모델 가중치를 해당 학습 데이터 및 구성 없이 공개하는 관행이 일반적이다. 이러한 중요한 격차를 해결하기 위해, 우리는 대규모 음성-언어 모델링을 위한 최초의 완전히 개방된 엔드투엔드 프레임워크인 LLaSO를 소개한다. LLaSO는 커뮤니티에 세 가지 필수 리소스를 제공한다: (1) 1,200만 개의 음성-텍스트 정렬 코퍼스인 LLaSO-Align, (2) 1,350만 개의 다중 작업 명령어 튜닝 데이터셋인 LLaSO-Instruct, 그리고 (3) 표준화된 평가를 위한 재현 가능한 벤치마크인 LLaSO-Eval. 우리의 프레임워크를 검증하기 위해, 우리는 공개 데이터만으로 학습된 38억 개의 파라미터를 가진 참조 모델인 LLaSO-Base를 구축하고 공개한다. 이 모델은 정규화 점수 0.72를 달성하여, 비교 가능한 모델들을 능가하는 강력하고 재현 가능한 기준선을 확립한다. 우리의 분석은 더 넓은 학습 범위가 성능을 향상시키지만, 특히 순수 오디오 시나리오에서 보이지 않는 작업에 대한 상당한 일반화 격차가 지속됨을 보여준다. 데이터, 벤치마크, 모델의 완전한 스택을 공개함으로써, LLaSO는 연구 노력을 통합하고 LSLM 분야에서 커뮤니티 주도의 진전을 가속화하기 위한 기초적인 개방형 표준을 확립한다. 우리는 코드, 데이터셋, 사전 학습된 모델, 그리고 결과를 https://github.com/EIT-NLP/LLaSO에서 공개한다.

English

The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.

LLaSO: 대규모 언어 및 음성 모델 연구의 재현 가능성을 위한 기초 프레임워크

LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

초록

Support