BhashaBench V1：印度语系领域象限综合基准评测

摘要

随着大语言模型的快速发展，领域与文化特异性评估的需求日益迫切。现有基准大多以英语为中心且缺乏领域针对性，限制了其在印度本土语境下的适用性。为弥补这一空白，我们推出首个面向关键印度知识体系的领域特异性、多任务、双语基准——BhashaBench V1。该基准包含74,166个精心构建的问答对（其中英语52,494组，印地语21,672组），数据源自主流政府考试与领域专项测试，覆盖农业、法律、金融、阿育吠陀四大核心领域，包含90余个子领域及500多个专题，支持细粒度评估。对29款大语言模型的测评显示，模型在不同领域和语言间存在显著性能差异，尤其在低资源领域表现悬殊。例如GPT-4o在法律领域总体准确率达76.49%，而在阿育吠陀领域仅为59.74%。所有领域内模型对英语内容的处理能力均优于印地语。子领域分析表明，网络法、国际金融等领域表现相对较好，而潘查卡尔马疗法、种子科学、人权等领域仍明显薄弱。BhashaBench V1为评估大语言模型在印度多元知识领域的表现提供了全面数据集，可检验模型融合领域知识与双语理解的能力。所有代码、基准及资源均已公开，以支持开放式研究。

English

The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.

BhashaBench V1：印度语系领域象限综合基准评测

BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

摘要

Support