合成数据在多语言、多文化AI系统中的作用: 来自印度语系的经验启示
The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages
September 25, 2025
作者: Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram
cs.AI
摘要
开发能够在多语言环境中有效运作且保持文化根基的AI系统,是一个长期存在的挑战,尤其是在资源匮乏的环境中。合成数据提供了一条充满希望的途径,但其在多语言和多文化背景下的有效性仍未得到充分探索。我们通过一种自下而上的生成策略,研究了为印度语言创建和评估合成、文化情境化数据集的影响,该策略促使大型开源LLM(参数≥235B)基于特定语言的维基百科内容进行数据生成。这一方法补充了当前主流的自上而下范式,即从高资源语言(如英语)翻译合成数据集。我们推出了Updesh,这是一个高质量的大规模合成指令跟随数据集,包含13种印度语言的950万条数据点,涵盖多样化的推理和生成任务,特别强调长上下文、多轮对话能力,以及与印度文化背景的契合度。通过结合自动化指标和人工标注的全面评估,涉及10,000次评估,结果表明生成的数据质量较高;然而,人工评估也指出了进一步改进的空间。此外,我们通过在数据集上微调模型,并在15个多样化的多语言数据集上评估性能,进行了下游评估。使用Updesh训练的模型在生成任务上持续取得显著提升,并在多项选择式自然语言理解任务中保持竞争力。值得注意的是,相对改进在低资源和中资源语言中最为显著,缩小了它们与高资源语言之间的差距。这些发现为实证证据表明,有效的多语言AI需要采用多方面的数据策展和生成策略,这些策略应融入情境感知、文化根基的方法论。
English
Developing AI systems that operate effectively across languages while
remaining culturally grounded is a long-standing challenge, particularly in
low-resource settings. Synthetic data provides a promising avenue, yet its
effectiveness in multilingual and multicultural contexts remains underexplored.
We investigate the creation and impact of synthetic, culturally contextualized
datasets for Indian languages through a bottom-up generation strategy that
prompts large open-source LLMs (>= 235B parameters) to ground data generation
in language-specific Wikipedia content. This approach complements the dominant
top-down paradigm of translating synthetic datasets from high-resource
languages such as English. We introduce Updesh, a high-quality large-scale
synthetic instruction-following dataset comprising 9.5M data points across 13
Indian languages, encompassing diverse reasoning and generative tasks with an
emphasis on long-context, multi-turn capabilities, and alignment with Indian
cultural contexts. A comprehensive evaluation incorporating both automated
metrics and human annotation across 10k assessments indicates that generated
data is high quality; though, human evaluation highlights areas for further
improvement. Additionally, we perform downstream evaluations by fine-tuning
models on our dataset and assessing the performance across 15 diverse
multilingual datasets. Models trained on Updesh consistently achieve
significant gains on generative tasks and remain competitive on multiple-choice
style NLU tasks. Notably, relative improvements are most pronounced in low and
medium-resource languages, narrowing their gap with high-resource languages.
These findings provide empirical evidence that effective multilingual AI
requires multi-faceted data curation and generation strategies that incorporate
context-aware, culturally grounded methodologies.