PARSA-Bench：一个全面的波斯语音频-语言模型基准测试平台

摘要

波斯语因其古典诗歌、传统音乐和普遍存在的语码转换现象，在音频理解领域带来独特挑战——现有基准测试均未涵盖这些特性。我们推出PARSA-Bench（波斯语音频推理与语音评估基准），这是首个针对波斯语言文化的大规模音频语言模型评估基准，包含16项任务逾8000个样本，涵盖语音理解、副语言分析及文化音频理解三大维度。其中十项任务为全新引入，包括诗歌格律与风格识别、传统波斯音乐理解及语码转换检测等。实验表明纯文本基线模型持续优于音频模型，暗示现有模型可能未能有效利用超越文本转录的音频特征。文化相关任务揭示出质的差异：所有模型在诗歌韵律检测任务中表现接近随机概率，且不随模型规模扩大而改善，表明当前模型尚未掌握韵律感知能力。数据集已公开于https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench。

English

Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at https://huggingface.co/datasets/MohammadJRanjbar/PARSA-Bench