ChatPaper.aiChatPaper

GuideX:面向零样本信息抽取的引导式合成数据生成

GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction

May 31, 2025
作者: Neil De La Fuente, Oscar Sainz, Iker García-Ferrero, Eneko Agirre
cs.AI

摘要

信息抽取(IE)系统传统上具有领域特定性,需要进行成本高昂的适配,包括专家模式设计、数据标注和模型训练。尽管大型语言模型在零样本信息抽取中展现出潜力,但在标签定义不同的未知领域中,性能显著下降。本文提出了GUIDEX,一种创新方法,能够自动定义领域特定模式、推断指导原则并生成合成标注实例,从而实现更好的跨领域泛化能力。通过使用GUIDEX对Llama 3.1进行微调,在七个零样本命名实体识别基准测试中创下了新的最先进水平。采用GUIDEX训练的模型,在无需人工标注数据的情况下,比以往方法提升了多达7个F1分数,结合人工标注数据后,更是高出近2个F1分数。基于GUIDEX训练的模型展现了对复杂领域特定标注模式更深入的理解。代码、模型及合成数据集可在neilus03.github.io/guidex.com获取。
English
Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets are available at neilus03.github.io/guidex.com

Summary

AI-Generated Summary

PDF32June 9, 2025