ChatPaper.aiChatPaper

合成数据在多语言、多文化AI系统中的作用: 来自印度语系的经验启示

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

September 25, 2025
作者: Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram
cs.AI

摘要

开发能够在多语言环境中有效运作且保持文化根基的AI系统,是一个长期存在的挑战,尤其是在资源匮乏的环境中。合成数据提供了一条充满希望的途径,但其在多语言和多文化背景下的有效性仍未得到充分探索。我们通过一种自下而上的生成策略,研究了为印度语言创建和评估合成、文化情境化数据集的影响,该策略促使大型开源LLM(参数≥235B)基于特定语言的维基百科内容进行数据生成。这一方法补充了当前主流的自上而下范式,即从高资源语言(如英语)翻译合成数据集。我们推出了Updesh,这是一个高质量的大规模合成指令跟随数据集,包含13种印度语言的950万条数据点,涵盖多样化的推理和生成任务,特别强调长上下文、多轮对话能力,以及与印度文化背景的契合度。通过结合自动化指标和人工标注的全面评估,涉及10,000次评估,结果表明生成的数据质量较高;然而,人工评估也指出了进一步改进的空间。此外,我们通过在数据集上微调模型,并在15个多样化的多语言数据集上评估性能,进行了下游评估。使用Updesh训练的模型在生成任务上持续取得显著提升,并在多项选择式自然语言理解任务中保持竞争力。值得注意的是,相对改进在低资源和中资源语言中最为显著,缩小了它们与高资源语言之间的差距。这些发现为实证证据表明,有效的多语言AI需要采用多方面的数据策展和生成策略,这些策略应融入情境感知、文化根基的方法论。
English
Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.
PDF32September 29, 2025