AHELM: 오디오-언어 모델의 종합적 평가

초록

오디오-언어 모델(ALMs)의 평가는 표준화된 벤치마크의 부재로 인해 어려움을 겪고 있다. 오디오와 텍스트가 교차된 입력을 받아 텍스트를 출력하는 멀티모달 모델인 ALMs의 경우, 대부분의 벤치마크는 단 한두 가지 기능만을 측정하며 공정성이나 안전성과 같은 평가적 측면을 생략한다. 또한, 별도의 평가들은 제한된 수의 모델만을 테스트하고 서로 다른 프롬프트 방법과 추론 파라미터를 사용하기 때문에 모델 간의 비교가 어렵다. 이러한 문제를 해결하기 위해, 우리는 AHELM이라는 벤치마크를 소개한다. AHELM은 다양한 데이터셋을 통합하며, 특히 스테레오타입 회피를 평가하는 PARADE와 대화형 오디오에 대한 추론을 다중 턴 질의응답을 통해 측정하는 CoRe-Bench라는 두 가지 새로운 합성 오디오-텍스트 데이터셋을 포함한다. 이를 통해 ALMs의 개발과 사용에 중요한 10가지 측면(오디오 인지, 지식, 추론, 감정 감지, 편향, 공정성, 다국어 지원, 견고성, 유해성, 안전성)을 종합적으로 측정한다. 또한, 모델 간의 공정한 비교를 위해 프롬프트, 추론 파라미터, 평가 메트릭을 표준화했다. 우리는 3개 개발사의 14개 오픈 웨이트 및 클로즈드 API ALMs와 각각 자동 음성 인식기와 언어 모델로 구성된 3개의 간단한 베이스라인 시스템을 테스트했다. 결과에 따르면, Gemini 2.5 Pro가 10개 측면 중 5개에서 최고 순위를 차지했지만, ASR 작업에서 그룹 불공정성(p=0.01)을 보인 반면 대부분의 다른 모델들은 그렇지 않았다. 또한, 베이스라인 시스템들이 AHELM에서 상당히 잘 수행되었으며, 하나의 시스템이 음성-텍스트 기능만 갖추고도 전체 5위를 차지했다. 투명성을 위해 모든 원시 프롬프트, 모델 생성물, 출력물은 https://crfm.stanford.edu/helm/audio/v1.0.0에서 확인할 수 있다. AHELM은 지속적으로 업데이트되는 벤치마크로, 새로운 데이터셋과 모델이 시간이 지남에 따라 추가될 예정이다.

English

Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness (p=0.01) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 5th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.

AHELM: 오디오-언어 모델의 종합적 평가

AHELM: A Holistic Evaluation of Audio-Language Models

초록

Support