解读多样性:印度人工智能研究领域综述
Decoding the Diversity: A Review of the Indic AI Research Landscape
June 13, 2024
作者: Sankalp KJ, Vinija Jain, Sreyoshi Bhaduri, Tamoghna Roy, Aman Chadha
cs.AI
摘要
这篇综述论文全面概述了印度语系语言中大型语言模型(LLM)的研究方向。印度语系语言指的是在印度次大陆地区使用的语言,包括印度、巴基斯坦、孟加拉国、斯里兰卡、尼泊尔和不丹等国。这些语言拥有丰富的文化和语言遗产,全球有超过15亿人口使用这些语言。随着自然语言处理(NLP)应用在多种语言中的巨大市场潜力和需求不断增长,印度语系语言的生成应用为研究提供了独特的挑战和机遇。我们的论文深入探讨了印度语系生成建模的最新进展,提供了一个研究方向分类法,并列出了84篇近期出版物。本文调查的研究方向包括LLM开发、微调现有LLM、语料库开发、基准测试和评估,以及围绕特定技术、工具和应用的出版物。我们发现,各个出版物中的研究人员都强调了有限数据可用性、缺乏标准化以及印度语系语言的独特语言复杂性所带来的挑战。这项工作旨在成为从事NLP领域研究和实践的研究人员的宝贵资源,尤其是那些专注于印度语系语言的人,并为开发更准确高效的LLM应用程序提供支持。
English
This review paper provides a comprehensive overview of large language model
(LLM) research directions within Indic languages. Indic languages are those
spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri
Lanka, Nepal, and Bhutan, among others. These languages have a rich cultural
and linguistic heritage and are spoken by over 1.5 billion people worldwide.
With the tremendous market potential and growing demand for natural language
processing (NLP) based applications in diverse languages, generative
applications for Indic languages pose unique challenges and opportunities for
research. Our paper deep dives into the recent advancements in Indic generative
modeling, contributing with a taxonomy of research directions, tabulating 84
recent publications. Research directions surveyed in this paper include LLM
development, fine-tuning existing LLMs, development of corpora, benchmarking
and evaluation, as well as publications around specific techniques, tools, and
applications. We found that researchers across the publications emphasize the
challenges associated with limited data availability, lack of standardization,
and the peculiar linguistic complexities of Indic languages. This work aims to
serve as a valuable resource for researchers and practitioners working in the
field of NLP, particularly those focused on Indic languages, and contributes to
the development of more accurate and efficient LLM applications for these
languages.