检索GPT:合并提示和数学模型以增强混合代码信息检索
RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval
November 7, 2024
作者: Aniket Deroy, Subhankar Maity
cs.AI
摘要
代码混合是指在单个句子中整合来自多种语言的词汇和语法元素,是一种广泛存在的语言现象,尤其在多语社会中尤为普遍。在印度,社交媒体用户经常使用罗马字母文字进行代码混合对话,特别是在形成在线群体以分享相关本地信息的移民社区中。本文关注从罗马字母转写的孟加拉语与英语混合对话中提取相关信息的挑战。该研究提出了一种新方法来解决这些挑战,即通过开发一种机制来自动识别代码混合对话中最相关的答案。我们在包含来自Facebook的查询和文档以及查询相关文件(QRels)的数据集上进行了实验以协助完成此任务。我们的结果表明,我们的方法在从复杂的代码混合数字对话中提取相关信息方面的有效性,有助于在多语言和非正式文本环境中的自然语言处理领域。我们使用GPT-3.5 Turbo通过提示以及利用相关文档的顺序性质构建数学模型,帮助检测与查询相关的文档。
English
Code-mixing, the integration of lexical and grammatical elements from
multiple languages within a single sentence, is a widespread linguistic
phenomenon, particularly prevalent in multilingual societies. In India, social
media users frequently engage in code-mixed conversations using the Roman
script, especially among migrant communities who form online groups to share
relevant local information. This paper focuses on the challenges of extracting
relevant information from code-mixed conversations, specifically within Roman
transliterated Bengali mixed with English. This study presents a novel approach
to address these challenges by developing a mechanism to automatically identify
the most relevant answers from code-mixed conversations. We have experimented
with a dataset comprising of queries and documents from Facebook, and Query
Relevance files (QRels) to aid in this task. Our results demonstrate the
effectiveness of our approach in extracting pertinent information from complex,
code-mixed digital conversations, contributing to the broader field of natural
language processing in multilingual and informal text environments. We use
GPT-3.5 Turbo via prompting alongwith using the sequential nature of relevant
documents to frame a mathematical model which helps to detect relevant
documents corresponding to a query.