合成數據在多語言、多文化人工智慧系統中的角色: 從印度語系語言中汲取的教訓
The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages
September 25, 2025
作者: Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram
cs.AI
摘要
開發能夠跨語言有效運作且保持文化根基的人工智慧系統,是一項長期存在的挑戰,特別是在資源匱乏的環境中。合成數據提供了一條有前景的途徑,但其在多語言和多文化背景下的有效性仍未得到充分探索。我們通過一種自下而上的生成策略,研究了為印度語言創建和評估合成、文化情境化數據集的影響,該策略促使大型開源語言模型(參數≥235B)基於特定語言的維基百科內容進行數據生成。這種方法補充了從高資源語言(如英語)翻譯合成數據集的主導自上而下範式。我們介紹了Updesh,這是一個高質量的大規模合成指令跟隨數據集,包含13種印度語言的950萬個數據點,涵蓋了多樣化的推理和生成任務,並強調長上下文、多輪對話能力以及與印度文化背景的對齊。通過結合自動化指標和人工註釋的全面評估,在10,000次評估中表明生成的數據質量高;然而,人工評估也指出了進一步改進的空間。此外,我們通過在我們的數據集上微調模型並在15個多樣化的多語言數據集上評估性能,進行了下游評估。在Updesh上訓練的模型在生成任務上持續取得顯著提升,並在多項選擇式自然語言理解任務中保持競爭力。值得注意的是,相對改進在低資源和中等資源語言中最為顯著,縮小了它們與高資源語言之間的差距。這些發現提供了實證證據,表明有效的多語言人工智慧需要多方面的數據策劃和生成策略,這些策略應包含情境感知、文化根基的方法論。
English
Developing AI systems that operate effectively across languages while
remaining culturally grounded is a long-standing challenge, particularly in
low-resource settings. Synthetic data provides a promising avenue, yet its
effectiveness in multilingual and multicultural contexts remains underexplored.
We investigate the creation and impact of synthetic, culturally contextualized
datasets for Indian languages through a bottom-up generation strategy that
prompts large open-source LLMs (>= 235B parameters) to ground data generation
in language-specific Wikipedia content. This approach complements the dominant
top-down paradigm of translating synthetic datasets from high-resource
languages such as English. We introduce Updesh, a high-quality large-scale
synthetic instruction-following dataset comprising 9.5M data points across 13
Indian languages, encompassing diverse reasoning and generative tasks with an
emphasis on long-context, multi-turn capabilities, and alignment with Indian
cultural contexts. A comprehensive evaluation incorporating both automated
metrics and human annotation across 10k assessments indicates that generated
data is high quality; though, human evaluation highlights areas for further
improvement. Additionally, we perform downstream evaluations by fine-tuning
models on our dataset and assessing the performance across 15 diverse
multilingual datasets. Models trained on Updesh consistently achieve
significant gains on generative tasks and remain competitive on multiple-choice
style NLU tasks. Notably, relative improvements are most pronounced in low and
medium-resource languages, narrowing their gap with high-resource languages.
These findings provide empirical evidence that effective multilingual AI
requires multi-faceted data curation and generation strategies that incorporate
context-aware, culturally grounded methodologies.