오디오-비주얼 대규모 언어 모델은 정말로 보고 들을 수 있을까?

초록

오디오-비주얼 대규모 언어 모델(AVLLM)은 다중 모달리티 인식을 위한 통합 인터페이스로 부상하고 있습니다. 본 연구는 AVLLM의 최초의 기계론적 해석 가능성 연구로서, 오디오와 시각적 특징이 AVLLM의 다양한 계층을 통해 어떻게 진화하고 융합되어 최종 텍스트 출력을 생성하는지 분석합니다. 우리는 AVLLM이 중간 계층에서 풍부한 오디오 의미를 인코딩함에도 불구하고, 오디오가 시각 정보와 충돌할 경우 이러한 능력이 최종 텍스트 생성 과정에서 크게 표면화되지 못한다는 사실을 발견했습니다. 프로빙 분석을 통해 유용한 잠재 오디오 정보가 존재하지만, 더 깊은 융합 계층에서 시각적 표현이 지나치게 우선시되어 오디오 단서를 억제하는 경향이 있음을 확인했습니다. 우리는 이러한 불균형이 훈련 과정에서 비롯됨을 추가적으로 추적했습니다. 즉, AVLLM의 오디오 행동이 시각-언어 기반 모델과 강하게 일치하여 오디오 감독에 대한 추가 정렬이 제한적임을 나타냅니다. 우리의 연구 결과는 AVLLM에 내재된 근본적인 모달리티 편향을 밝히고, 다중 모달리티 LLM이 오디오와 시각을 통합하는 방식에 대한 새로운 기계론적 통찰을 제공합니다.

English

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.

오디오-비주얼 대규모 언어 모델은 정말로 보고 들을 수 있을까?

Do Audio-Visual Large Language Models Really See and Hear?

초록

Support