DRISHTIKON：一個多模態多語種的基準測試，用於評估語言模型對印度文化的理解能力

摘要

我们推出DRISHTIKON，这是一项首创的多模态、多语言基准测试，专注于印度文化，旨在评估生成式人工智能系统的文化理解能力。与现有具有通用性或全球视野的基准测试不同，DRISHTIKON提供了对印度多样地区深入且细致的覆盖，涵盖15种语言，覆盖所有邦和联邦属地，并整合了超过64,000组对齐的文本-图像对。该数据集捕捉了丰富的文化主题，包括节日、服饰、美食、艺术形式及历史遗产等众多方面。我们评估了广泛的视觉-语言模型（VLMs），包括开源的小型和大型模型、专有系统、专门用于推理的VLMs以及专注于印度语言的模型，在零样本和思维链设置下进行测试。我们的结果揭示了当前模型在处理基于文化的多模态输入，特别是低资源语言和较少文献记载的传统时，存在关键局限性。DRISHTIKON填补了包容性人工智能研究中的一个重要空白，为推进具有文化意识、多模态能力的语言技术提供了一个强有力的测试平台。

English

We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingual benchmark centered exclusively on Indian culture, designed to evaluate the cultural understanding of generative AI systems. Unlike existing benchmarks with a generic or global scope, DRISHTIKON offers deep, fine-grained coverage across India's diverse regions, spanning 15 languages, covering all states and union territories, and incorporating over 64,000 aligned text-image pairs. The dataset captures rich cultural themes including festivals, attire, cuisines, art forms, and historical heritage amongst many more. We evaluate a wide range of vision-language models (VLMs), including open-source small and large models, proprietary systems, reasoning-specialized VLMs, and Indic-focused models, across zero-shot and chain-of-thought settings. Our results expose key limitations in current models' ability to reason over culturally grounded, multimodal inputs, particularly for low-resource languages and less-documented traditions. DRISHTIKON fills a vital gap in inclusive AI research, offering a robust testbed to advance culturally aware, multimodally competent language technologies.