多言語・多文化AIシステムにおける合成データの役割：インド諸言語からの教訓

要旨

複数の言語にわたって効果的に機能しつつ、文化的に根ざしたAIシステムを開発することは、特にリソースが限られた環境において、長年の課題となっています。合成データは有望なアプローチを提供しますが、多言語・多文化の文脈におけるその有効性はまだ十分に検証されていません。本研究では、大規模なオープンソースLLM（235Bパラメータ以上）を活用し、言語固有のWikipediaコンテンツに基づいてデータ生成を行うボトムアップ戦略を通じて、インドの言語向けの文化的に文脈化された合成データセットの作成とその影響を調査します。このアプローチは、英語のような高リソース言語からの合成データセットの翻訳という主流のトップダウンパラダイムを補完するものです。私たちは、13のインド言語にわたる950万のデータポイントを含む、長文脈・多ターンの能力とインドの文化的文脈との整合性を重視した多様な推論および生成タスクを網羅する、高品質な大規模合成指示追従データセット「Updesh」を紹介します。自動化された指標と人間によるアノテーションを組み合わせた1万件の評価を通じた包括的な評価は、生成されたデータが高品質であることを示していますが、人間による評価はさらなる改善の余地を指摘しています。さらに、私たちのデータセットでモデルをファインチューニングし、15の多様な多言語データセットにわたる性能を評価する下流評価を行います。Updeshでトレーニングされたモデルは、生成タスクにおいて一貫して大幅な向上を達成し、多肢選択形式のNLUタスクにおいても競争力を維持します。特に、低リソースおよび中リソース言語における相対的な改善が最も顕著であり、高リソース言語とのギャップを縮めています。これらの発見は、効果的な多言語AIには、文脈を意識し、文化的に根ざした方法論を取り入れた多面的なデータキュレーションおよび生成戦略が必要であることを実証する経験的証拠を提供します。

English

Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.

多言語・多文化AIシステムにおける合成データの役割：インド諸言語からの教訓

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

要旨

Support