PARSA-Bench：包括的なペルシア語音声言語モデルベンチマーク

要旨

ペルシャ語は、古典詩、伝統音楽、広範なコードスイッチングを通じて、独自の音声理解課題を提起しており、既存のベンチマークではこれらを捕捉できていない。本論文では、ペルシャ語とその文化に特化した大規模音声言語モデル評価のための初のベンチマーク「PARSA-Bench（Persian Audio Reasoning and Speech Assessment Benchmark）」を提案する。本ベンチマークは、音声理解、パラ言語情報分析、文化的音声理解の3分野にわたる16タスク、8,000以上のサンプルで構成される。新規導入した10タスクには、詩の韻律・形式検出、伝統的ペルシャ音楽の理解、コードスイッチング検出が含まれる。テキストのみのベースラインは一貫して音声モデルを上回り、モデルが文字起こし以上の音声特有の情報を活用できていない可能性を示唆する。文化的基盤に立つタスクでは質的に異なる失敗モードが明らかになった：特に韻律（vazn）検出では、モデル規模に関わらず全てのモデルがほぼランダムな精度に留まり、現在のモデルが韻律知覚を獲得できていないことが示された。データセットは https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench で公開されている。

English

Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench

PARSA-Bench：包括的なペルシア語音声言語モデルベンチマーク

PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

要旨

Support