解碼多樣性:印度AI研究領域回顧
Decoding the Diversity: A Review of the Indic AI Research Landscape
June 13, 2024
作者: Sankalp KJ, Vinija Jain, Sreyoshi Bhaduri, Tamoghna Roy, Aman Chadha
cs.AI
摘要
這篇綜述論文全面概述了印度語言中大型語言模型(LLM)的研究方向。印度語言是指印度次大陸地區的語言,包括印度、巴基斯坦、孟加拉、斯里蘭卡、尼泊爾和不丹等國家。這些語言擁有豐富的文化和語言遺產,全球有超過 15 億人口使用。隨著自然語言處理(NLP)應用在多種語言中市場潛力巨大且需求不斷增長,印度語言的生成應用為研究帶來獨特的挑戰和機遇。我們的論文深入探討了印度生成模型的最新進展,提出了一個研究方向的分類,列出了 84 篇近期出版物。本文調查的研究方向包括LLM的開發、微調現有LLM、語料庫的開發、基準測試和評估,以及針對特定技術、工具和應用的出版物。我們發現,各出版物中的研究人員都強調了有限數據可用性、缺乏標準化以及印度語言的獨特語言複雜性所帶來的挑戰。這項工作旨在成為NLP領域從事研究和實踐的價值資源,特別是那些專注於印度語言的人,並有助於為這些語言開發更準確和高效的LLM應用。
English
This review paper provides a comprehensive overview of large language model
(LLM) research directions within Indic languages. Indic languages are those
spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri
Lanka, Nepal, and Bhutan, among others. These languages have a rich cultural
and linguistic heritage and are spoken by over 1.5 billion people worldwide.
With the tremendous market potential and growing demand for natural language
processing (NLP) based applications in diverse languages, generative
applications for Indic languages pose unique challenges and opportunities for
research. Our paper deep dives into the recent advancements in Indic generative
modeling, contributing with a taxonomy of research directions, tabulating 84
recent publications. Research directions surveyed in this paper include LLM
development, fine-tuning existing LLMs, development of corpora, benchmarking
and evaluation, as well as publications around specific techniques, tools, and
applications. We found that researchers across the publications emphasize the
challenges associated with limited data availability, lack of standardization,
and the peculiar linguistic complexities of Indic languages. This work aims to
serve as a valuable resource for researchers and practitioners working in the
field of NLP, particularly those focused on Indic languages, and contributes to
the development of more accurate and efficient LLM applications for these
languages.Summary
AI-Generated Summary