SpotSound: 세분화된 시간적 접지 기능을 통한 대규모 오디오-언어 모델 성능 향상

초록

대규모 오디오-언어 모델(ALM)은 최근 전체적 오디오 이해에서 뛰어난 능력을 보여주고 있지만, 긴 형식의 오디오 내에서 사건이 정확히 언제 발생하는지를 특정하는 시간적 근거 설정(task of temporal grounding)에서는 여전히 신뢰할 수 없는 성능을 보입니다. 이러한 한계는 두 가지 요인에서 비롯됩니다: 정확한 타임스탬프가 없는 클립 수준의 감독 데이터로 주로 구성된 훈련 데이터, 그리고 짧은 사건이 밀집된 배경 음향에 가려지는 실제 시나리오를 제대로 반영하지 못하는 벤치마크입니다. 본 논문에서는 오디오 사건 근거 설정을 위해 설계된 오디오 언어 모델인 SpotSound를 소개합니다. SpotSound는 입력에 존재하지 않는 사건에 대한 허구적(hallucinated) 타임스탬프 생성을 억제하도록 특별히 설계된 새로운 훈련 목적 함수를 통합합니다. 또한, 각 클립의 약 10% 미만만을 대상 사건이 차지하여 엄격한 '건초 더미 속 바늘 찾기' 평가를 생성하는 도전적인 시간적 근거 설정 벤치마크인 SpotSound-Bench를 제시합니다. 실험 결과, SpotSound는 시간적 근거 설정 벤치마크에서 최첨단 성능을 달성하면서도 일반적인 하위 오디오-언어 작업 전반에 걸쳐 견고한 성능을 유지함을 보여줍니다. 코드, 모델 및 벤치마크는 https://loiesun.github.io/spotsound/에서 공개됩니다.

English

Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from two factors: training data dominated by clip-level supervision lacking precise timestamps, and benchmarks that fail to simulate real-world scenarios where short events are obscured by dense background sounds. In this paper, we introduce SpotSound, an audio language model designed for grounding audio events. SpotSound incorporates a novel training objective, specifically designed to suppress hallucinated timestamps for events absent from the input. Additionally, we present SpotSound-Bench, a challenging temporal grounding benchmark where target events occupy less than ~10\% of each clip, creating a rigorous `needle-in-a-haystack' evaluation. Experiments demonstrate that SpotSound achieves state-of-the-art results on temporal grounding benchmarks while maintaining robust performance across general downstream audio-language tasks. Code, models and benchmark are released on https://loiesun.github.io/spotsound/

SpotSound: 세분화된 시간적 접지 기능을 통한 대규모 오디오-언어 모델 성능 향상

SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

초록

Support