PARSA-Bench: 포괄적인 페르시아어 오디오-언어 모델 벤치마크

초록

페르시아어는 고전 시, 전통 음악, 그리고 널리 퍼져 있는 코드 스위칭(code-switching)을 통해 독특한 오디오 이해 과제를 제시하는데, 기존 벤치마크들은 이를 포착하지 못했습니다. 우리는 페르시아어 오디오-언어 대규모 모델의 성능을 평가하기 위한 최초의 벤치마크인 PARSA-Bench(페르시아어 오디오 추론 및 음성 평가 벤치마크)를 소개합니다. 이 벤치마크는 음성 이해, 준언어적(paralinguistic) 분석, 문화적 오디오 이해 분야에 걸쳐 16개 과제와 8,000개 이상의 샘플로 구성되어 있습니다. 그중 10개 과제(시의 운율 및 스타일 감지, 전통 페르시아 음악 이해, 코드 스위칭 감지 등)는 새롭게 도입되었습니다. 텍스트만을 사용한 기준 모델(baseline)이 지속적으로 오디오 모델보다 우수한 성능을 보였는데, 이는 모델들이 음성 인식(transcription)만으로 제공되는 정보를 넘어서는 오디오 고유의 정보를 활용하지 못할 수 있음을 시사합니다. 문화에 기반을 둔 과제들은 질적으로 뚜렷한 실패 양상을 드러냈습니다: 모든 모델은 규모와 관계없이 운율(vazn) 감지에서 거의 무작위 수준의 성능을 보여, 현재 모델들이 운율 인식(prosodic perception)에 도달하지 못하고 있음을 시사합니다. 해당 데이터셋은 https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench 에서 공개되어 있습니다.

English

Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench

PARSA-Bench: 포괄적인 페르시아어 오디오-언어 모델 벤치마크

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

초록

Support