ChatPaper.aiChatPaper

多语言相互增强效应(MMM):混合数据集并使用开放领域信息抽取大型语言模型进行测试。

MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models

July 15, 2024
作者: Chengguang Gan, Qingyu Yin, Xinyang He, Hanjun Wei, Yunhao Liang, Younghun Lim, Shijian Wang, Hexiang Huang, Qinghao Zhang, Shiwen Ni, Tatsunori Mori
cs.AI

摘要

相互强化效应(MRE)代表了信息提取和多任务研究中的一个有前景的途径。然而,由于仅有日语MRE混合数据集的独占性可用性,全球研究社区对其进行全面探索的可能性受到了限制。为了解决这一局限性,我们引入了一个跨语言MRE混合数据集(MMM),包括英语、日语和中文在内的21个子数据集。在本文中,我们还提出了一种数据集翻译方法,借助大型语言模型(LLMs)显著减少了数据集构建所需的手动注释时间,通过利用LLMs来翻译原始的日语数据集。此外,我们通过整合开放领域命名实体识别(NER)和句子分类任务来丰富数据集。利用这个扩展的数据集,我们开发了一个统一的输入-输出框架来训练一个开放领域信息提取大型语言模型(OIELLM)。OIELLM模型展示了有效处理新的MMM数据集的能力,表现出性能的显著提升。
English
The Mutual Reinforcement Effect (MRE) represents a promising avenue in information extraction and multitasking research. Nevertheless, its applicability has been constrained due to the exclusive availability of MRE mix datasets in Japanese, thereby limiting comprehensive exploration by the global research community. To address this limitation, we introduce a Multilingual MRE mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and Chinese. In this paper, we also propose a method for dataset translation assisted by Large Language Models (LLMs), which significantly reduces the manual annotation time required for dataset construction by leveraging LLMs to translate the original Japanese datasets. Additionally, we have enriched the dataset by incorporating open-domain Named Entity Recognition (NER) and sentence classification tasks. Utilizing this expanded dataset, we developed a unified input-output framework to train an Open-domain Information Extraction Large Language Model (OIELLM). The OIELLM model demonstrates the capability to effectively process novel MMM datasets, exhibiting significant improvements in performance.

Summary

AI-Generated Summary

PDF52November 28, 2024