IndicParam:评估大语言模型在低资源印度语言上的基准测试
IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages
November 29, 2025
作者: Ayush Maheshwari, Kaushal Sharma, Vivek Patel, Aditya Maheshwari
cs.AI
摘要
尽管大语言模型在高资源多语言任务中表现出色,但对低资源及极低资源印度语系的评估仍严重不足。我们推出IndicParam——一个包含超过1.3万道多选题的人工标注基准数据集,涵盖11种印度语言(尼泊尔语、古吉拉特语、马拉地语、奥里亚语列为低资源语言;多格拉语、迈蒂利语、拉贾斯坦语、梵语、博多语、桑塔利语、孔卡尼语列为极低资源语言)以及梵英混合语集。我们对19个专有及开源权重的大模型进行评估,结果显示即使表现最佳的GPT-5平均准确率也仅达45.0%,DeepSeek-3.2(43.1%)和Claude-4.5(42.7%)紧随其后。我们还为每道题目标注知识导向型或纯语言型分类,以区分事实记忆与语法能力。此外,我们评估了大模型处理多样化题型的能力——除常规多选题外,还包括列表匹配、论断-原因配对和序列排序等题型。IndicParam揭示了跨语言迁移的局限性,为印度语系建立了具有挑战性的评估基准。数据集详见https://huggingface.co/datasets/bharatgenai/IndicParam,基准测试脚本位于https://github.com/ayushbits/IndicParam。
English
While large language models excel on high-resource multilingual tasks, low- and extremely low-resource Indic languages remain severely under-evaluated. We present IndicParam, a human-curated benchmark of over 13,000 multiple-choice questions covering 11 such languages (Nepali, Gujarati, Marathi, Odia as low-resource; Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani as extremely low-resource) plus Sanskrit-English code-mixed set. We evaluated 19 LLMs, both proprietary and open-weights, which reveals that even the top-performing GPT-5 reaches only 45.0% average accuracy, followed by DeepSeek-3.2 (43.1) and Claude-4.5 (42.7). We additionally label each question as knowledge-oriented or purely linguistic to discriminate factual recall from grammatical proficiency. Further, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. IndicParam provides insights into limitations of cross-lingual transfer and establishes a challenging benchmark for Indic languages. The dataset is available at https://huggingface.co/datasets/bharatgenai/IndicParam. Scripts to run benchmark are present at https://github.com/ayushbits/IndicParam.