LLM 백본의 청각 지식이 오디오 언어 모델에 미치는 영향: 포괄적 평가

초록

대규모 언어 모델(LLM)은 대규모 오디오 언어 모델(LALM)의 지식 백본으로 널리 사용되어 왔지만, 텍스트만으로 사전 학습을 통해 이들이 얼마나 많은 청각 지식을 인코딩하는지, 그리고 이것이 다운스트림 성능에 어떤 영향을 미치는지는 여전히 불분명합니다. 우리는 서로 다른 LLM을 두 가지 텍스트 기반 설정과 한 가지 오디오 기반 설정 하에서 비교하여 이러한 격차를 연구합니다: (1) 청각 지식의 폭과 깊이를 테스트하기 위해 구성된 벤치마크인 AKB-2000에 대한 직접 프로빙; (2) LLM이 오디오 캡션 생성기로부터 얻은 텍스트 설명을 기반으로 추론하는 캐스케이드 평가; (3) 각 LLM이 오디오 인코더와 함께 대규모 오디오 언어 모델(LALM)로 미세 조정되는 오디오 기반 평가. 우리의 연구 결과는 청각 지식이 모델 계열에 따라 상당히 다르며, 텍스트만으로 평가한 결과가 오디오 성능과 강한 상관관계를 보인다는 것을 밝혔습니다. 본 연구는 오디오 연구에서 LLM을 포괄적으로 이해하기 위한 실증적 기반을 제공합니다.

English

Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

LLM 백본의 청각 지식이 오디오 언어 모델에 미치는 영향: 포괄적 평가

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

초록

Support