GuideX:面向零样本信息抽取的引导式合成数据生成
GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction
May 31, 2025
作者: Neil De La Fuente, Oscar Sainz, Iker García-Ferrero, Eneko Agirre
cs.AI
摘要
信息抽取(IE)系统传统上具有领域特定性,需要进行成本高昂的适配,包括专家模式设计、数据标注和模型训练。尽管大型语言模型在零样本信息抽取中展现出潜力,但在标签定义不同的未知领域中,性能显著下降。本文提出了GUIDEX,一种创新方法,能够自动定义领域特定模式、推断指导原则并生成合成标注实例,从而实现更好的跨领域泛化能力。通过使用GUIDEX对Llama 3.1进行微调,在七个零样本命名实体识别基准测试中创下了新的最先进水平。采用GUIDEX训练的模型,在无需人工标注数据的情况下,比以往方法提升了多达7个F1分数,结合人工标注数据后,更是高出近2个F1分数。基于GUIDEX训练的模型展现了对复杂领域特定标注模式更深入的理解。代码、模型及合成数据集可在neilus03.github.io/guidex.com获取。
English
Information Extraction (IE) systems are traditionally domain-specific,
requiring costly adaptation that involves expert schema design, data
annotation, and model training. While Large Language Models have shown promise
in zero-shot IE, performance degrades significantly in unseen domains where
label definitions differ. This paper introduces GUIDEX, a novel method that
automatically defines domain-specific schemas, infers guidelines, and generates
synthetically labeled instances, allowing for better out-of-domain
generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art
across seven zeroshot Named Entity Recognition benchmarks. Models trained with
GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data,
and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX
demonstrate enhanced comprehension of complex, domain-specific annotation
schemas. Code, models, and synthetic datasets are available at
neilus03.github.io/guidex.comSummary
AI-Generated Summary