BhashaBench V1:印度语系领域象限综合基准评测
BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
October 29, 2025
作者: Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan
cs.AI
摘要
随着大语言模型的快速发展,领域与文化特异性评估的需求日益迫切。现有基准大多以英语为中心且缺乏领域针对性,限制了其在印度本土语境下的适用性。为弥补这一空白,我们推出首个面向关键印度知识体系的领域特异性、多任务、双语基准——BhashaBench V1。该基准包含74,166个精心构建的问答对(其中英语52,494组,印地语21,672组),数据源自主流政府考试与领域专项测试,覆盖农业、法律、金融、阿育吠陀四大核心领域,包含90余个子领域及500多个专题,支持细粒度评估。对29款大语言模型的测评显示,模型在不同领域和语言间存在显著性能差异,尤其在低资源领域表现悬殊。例如GPT-4o在法律领域总体准确率达76.49%,而在阿育吠陀领域仅为59.74%。所有领域内模型对英语内容的处理能力均优于印地语。子领域分析表明,网络法、国际金融等领域表现相对较好,而潘查卡尔马疗法、种子科学、人权等领域仍明显薄弱。BhashaBench V1为评估大语言模型在印度多元知识领域的表现提供了全面数据集,可检验模型融合领域知识与双语理解的能力。所有代码、基准及资源均已公开,以支持开放式研究。
English
The rapid advancement of large language models(LLMs) has intensified the need
for domain and culture specific evaluation. Existing benchmarks are largely
Anglocentric and domain-agnostic, limiting their applicability to India-centric
contexts. To address this gap, we introduce BhashaBench V1, the first
domain-specific, multi-task, bilingual benchmark focusing on critical Indic
knowledge systems. BhashaBench V1 contains 74,166 meticulously curated
question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from
authentic government and domain-specific exams. It spans four major domains:
Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and
covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs
reveals significant domain and language specific performance gaps, with
especially large disparities in low-resource domains. For instance, GPT-4o
achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models
consistently perform better on English content compared to Hindi across all
domains. Subdomain-level analysis shows that areas such as Cyber Law,
International Finance perform relatively well, while Panchakarma, Seed Science,
and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive
dataset for evaluating large language models across India's diverse knowledge
domains. It enables assessment of models' ability to integrate domain-specific
knowledge with bilingual understanding. All code, benchmarks, and resources are
publicly available to support open research.