合成数据在多语言、多文化AI系统中的作用：来自印度语系的经验启示

摘要

开发能够在多语言环境中有效运作且保持文化根基的AI系统，是一个长期存在的挑战，尤其是在资源匮乏的环境中。合成数据提供了一条充满希望的途径，但其在多语言和多文化背景下的有效性仍未得到充分探索。我们通过一种自下而上的生成策略，研究了为印度语言创建和评估合成、文化情境化数据集的影响，该策略促使大型开源LLM（参数≥235B）基于特定语言的维基百科内容进行数据生成。这一方法补充了当前主流的自上而下范式，即从高资源语言（如英语）翻译合成数据集。我们推出了Updesh，这是一个高质量的大规模合成指令跟随数据集，包含13种印度语言的950万条数据点，涵盖多样化的推理和生成任务，特别强调长上下文、多轮对话能力，以及与印度文化背景的契合度。通过结合自动化指标和人工标注的全面评估，涉及10,000次评估，结果表明生成的数据质量较高；然而，人工评估也指出了进一步改进的空间。此外，我们通过在数据集上微调模型，并在15个多样化的多语言数据集上评估性能，进行了下游评估。使用Updesh训练的模型在生成任务上持续取得显著提升，并在多项选择式自然语言理解任务中保持竞争力。值得注意的是，相对改进在低资源和中资源语言中最为显著，缩小了它们与高资源语言之间的差距。这些发现为实证证据表明，有效的多语言AI需要采用多方面的数据策展和生成策略，这些策略应融入情境感知、文化根基的方法论。

English

Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.

合成数据在多语言、多文化AI系统中的作用：来自印度语系的经验启示

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

摘要

Support