다국어 및 다문화 AI 시스템에서 합성 데이터의 역할: 인도 언어 사례를 통한 교훈

초록

다양한 언어 간에 효과적으로 작동하면서도 문화적 토대를 유지하는 AI 시스템을 개발하는 것은, 특히 저자원 환경에서 오랜 과제로 남아 있습니다. 합성 데이터는 유망한 접근 방식을 제공하지만, 다국어 및 다문화적 맥락에서의 효과성은 아직 충분히 탐구되지 않았습니다. 우리는 대형 오픈소스 LLM(>= 235B 매개변수)이 언어별 위키피디아 콘텐츠를 기반으로 데이터 생성을 수행하도록 유도하는 하향식 생성 전략을 통해 인도 언어를 위한 합성적이고 문화적으로 맥락화된 데이터셋의 생성과 영향을 조사합니다. 이 접근 방식은 영어와 같은 고자원 언어에서 합성 데이터셋을 번역하는 주류의 상향식 패러다임을 보완합니다. 우리는 13개 인도 언어에 걸쳐 9.5M 데이터 포인트를 포함하며, 장문맥 및 다중 턴 기능을 강조하고 인도 문화적 맥락과 일치하는 다양한 추론 및 생성 작업을 포함한 고품질 대규모 합성 명령어 데이터셋인 Updesh를 소개합니다. 10,000개 평가에 걸친 자동화된 지표와 인간 주석을 포함한 포괄적인 평가는 생성된 데이터가 고품질임을 나타내지만, 인간 평가는 추가 개선이 필요한 부분을 강조합니다. 또한, 우리는 데이터셋을 기반으로 모델을 미세 조정하고 15개의 다양한 다국어 데이터셋에서 성능을 평가하는 다운스트림 평가를 수행합니다. Updesh로 훈련된 모델은 생성 작업에서 지속적으로 상당한 성과를 달성하며, 객관식 스타일의 NLU 작업에서도 경쟁력을 유지합니다. 특히, 저자원 및 중간 자원 언어에서 상대적 개선이 가장 두드러지며, 이들 언어와 고자원 언어 간의 격차를 좁히는 것으로 나타났습니다. 이러한 발견은 효과적인 다국어 AI를 위해서는 맥락을 인지하고 문화적 토대를 포함한 다각적인 데이터 큐레이션 및 생성 전략이 필요하다는 실증적 증거를 제공합니다.

English

Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.

다국어 및 다문화 AI 시스템에서 합성 데이터의 역할: 인도 언어 사례를 통한 교훈

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

초록

Support