GuideX：面向零样本信息抽取的引导式合成数据生成

摘要

信息抽取（IE）系统传统上具有领域特定性，需要进行成本高昂的适配，包括专家模式设计、数据标注和模型训练。尽管大型语言模型在零样本信息抽取中展现出潜力，但在标签定义不同的未知领域中，性能显著下降。本文提出了GUIDEX，一种创新方法，能够自动定义领域特定模式、推断指导原则并生成合成标注实例，从而实现更好的跨领域泛化能力。通过使用GUIDEX对Llama 3.1进行微调，在七个零样本命名实体识别基准测试中创下了新的最先进水平。采用GUIDEX训练的模型，在无需人工标注数据的情况下，比以往方法提升了多达7个F1分数，结合人工标注数据后，更是高出近2个F1分数。基于GUIDEX训练的模型展现了对复杂领域特定标注模式更深入的理解。代码、模型及合成数据集可在neilus03.github.io/guidex.com获取。

English

Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets are available at neilus03.github.io/guidex.com

GuideX：面向零样本信息抽取的引导式合成数据生成

GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction

摘要

Support