오디오-언어 모델은 듣고 있는가? 적응형 오디오 조정을 위한 오디오 전문 헤드

초록

다중모달 대규모 언어 모델은 비텍스트 입력에 대한 예측을 근거로 두기보다 언어적 사전 지식에 과도하게 의존하는 텍스트 주도적 경향을 보일 수 있습니다. 대표적인 예로 대규모 오디오-언어 모델(LALMs)은 오디오 증거가 중요한 정보를 포함하고 있음에도 불구하고 이를 충분히 활용하지 못하는 경우가 있습니다. 이러한 문제를 해결하기 위해 우리는 기계론적 해석 가능성 방법을 사용하여 오디오 주의력이 '청취' 신호를 생성하는 소수의 오디오 전문가 어텐션 헤드를 식별했습니다. 이 신호는 오디오 증거가 모델의 출력에 영향을 미칠 때 증가하며, 표준 프롬프팅 조건에서 오디오 참여 정도를 나타내는 지표로 활용될 수 있음을 보여줍니다. 이러한 국소화 결과를 바탕으로 우리는 오디오-침묵 조정 방향을 구성하고 추론 시점에 최종 표현에 활성화 개입을 적용하여 모델의 오디오 효과를 증폭시켰습니다. 이러한 개입의 유용성을 입증하기 위해 MMAU에서 두 가지 Qwen 기반 LALMs의 정확도를 매개변수 업데이트 없이 최대 +8.0%p 향상시킬 수 있음을 확인했습니다.

English

Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.

오디오-언어 모델은 듣고 있는가? 적응형 오디오 조정을 위한 오디오 전문 헤드

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

초록

Support