ChatPaper.aiChatPaper

MMM:多語言相互增強效應:混合數據集並與開放領域信息提取大型語言模型進行測試

MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

July 15, 2024
作者: Chengguang Gan, Qingyu Yin, Xinyang He, Hanjun Wei, Yunhao Liang, Younghun Lim, Shijian Wang, Hexiang Huang, Qinghao Zhang, Shiwen Ni, Tatsunori Mori
cs.AI

摘要

相互強化效應(MRE)代表了信息提取和多任務研究中一個具有前景的途徑。然而,由於 MRE 混合數據集僅以日語為獨家提供,因此全球研究社區的全面探索受到了限制。為解決這一限制,我們引入了一個多語言 MRE 混合數據集(MMM),包括英語、日語和中文的 21 個子數據集。在本文中,我們還提出了一種數據集翻譯方法,該方法借助大型語言模型(LLMs)顯著減少了數據集構建所需的手動標註時間,通過利用 LLMs 將原始日語數據集進行翻譯。此外,我們通過加入開放域命名實體識別(NER)和句子分類任務來豐富數據集。利用這個擴展數據集,我們開發了一個統一的輸入-輸出框架來訓練一個開放域信息提取大型語言模型(OIELLM)。OIELLM 模型展示了有效處理新的 MMM 數據集的能力,表現出顯著的性能改進。
English
The Mutual Reinforcement Effect (MRE) represents a promising avenue in information extraction and multitasking research. Nevertheless, its applicability has been constrained due to the exclusive availability of MRE mix datasets in Japanese, thereby limiting comprehensive exploration by the global research community. To address this limitation, we introduce a Multilingual MRE mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and Chinese. In this paper, we also propose a method for dataset translation assisted by Large Language Models (LLMs), which significantly reduces the manual annotation time required for dataset construction by leveraging LLMs to translate the original Japanese datasets. Additionally, we have enriched the dataset by incorporating open-domain Named Entity Recognition (NER) and sentence classification tasks. Utilizing this expanded dataset, we developed a unified input-output framework to train an Open-domain Information Extraction Large Language Model (OIELLM). The OIELLM model demonstrates the capability to effectively process novel MMM datasets, exhibiting significant improvements in performance.

Summary

AI-Generated Summary

PDF52November 28, 2024