ChatPaper.aiChatPaper

BhashaBench V1:印度语系四大领域综合基准评测体系

BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

October 29, 2025
作者: Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan
cs.AI

摘要

大型语言模型(LLMs)的快速发展加剧了对领域与文化特异性评估的需求。现有基准大多以英语为中心且缺乏领域针对性,限制了其在印度本土场景的适用性。为填补这一空白,我们推出首个面向关键印度知识体系的领域专用、多任务、双语基准——BhashaBench V1。该基准包含74,166个精心构建的问答对(其中52,494个为英文,21,672个为印地语),数据源自真实的政府及专业领域考试,涵盖农业、法律、金融与阿育吠陀四大核心领域,包含90余个子领域及500多个主题,支持细粒度评估。对29款LLMs的评估显示,模型在不同领域和语言间存在显著性能差距,尤其在低资源领域表现悬殊(例如GPT-4o在法律领域准确率达76.49%,而在阿育吠陀领域仅为59.74%)。所有领域内模型对英文内容的处理能力均优于印地语。子领域分析表明,网络法、国际金融等领域表现相对较好,而潘查卡尔玛疗法、种子科学、人权等领域仍明显薄弱。BhashaBench V1为评估LLMs在印度多元知识领域的表现提供了全面数据集,可检验模型融合领域知识与双语理解的能力。所有代码、基准及资源均已开源以支持开放研究。
English
The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.
PDF31December 2, 2025