GuideX：面向零樣本信息抽取的引導式合成數據生成

摘要

資訊抽取（IE）系統傳統上具有領域特定性，需要進行昂貴的適應過程，包括專家模式設計、資料標註及模型訓練。儘管大型語言模型在零樣本資訊抽取中展現出潛力，但在標籤定義不同的未知領域中，其性能顯著下降。本文介紹了GUIDEX，一種新穎的方法，能自動定義領域特定模式、推導指導方針並生成合成標註實例，從而實現更好的跨領域泛化能力。通過使用GUIDEX對Llama 3.1進行微調，在七個零樣本命名實體識別基準測試中創下了新的最佳紀錄。與先前方法相比，利用GUIDEX訓練的模型在無人為標註資料的情況下，F1分數提升了多達7分，而與之結合時，則提升了近2分。基於GUIDEX訓練的模型展現出對複雜、領域特定註解模式更深入的理解。相關程式碼、模型及合成資料集可於neilus03.github.io/guidex.com獲取。

English

Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets are available at neilus03.github.io/guidex.com

GuideX：面向零樣本信息抽取的引導式合成數據生成

GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction

摘要

Support