GuideX:面向零樣本信息抽取的引導式合成數據生成
GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction
May 31, 2025
作者: Neil De La Fuente, Oscar Sainz, Iker García-Ferrero, Eneko Agirre
cs.AI
摘要
資訊抽取(IE)系統傳統上具有領域特定性,需要進行昂貴的適應過程,包括專家模式設計、資料標註及模型訓練。儘管大型語言模型在零樣本資訊抽取中展現出潛力,但在標籤定義不同的未知領域中,其性能顯著下降。本文介紹了GUIDEX,一種新穎的方法,能自動定義領域特定模式、推導指導方針並生成合成標註實例,從而實現更好的跨領域泛化能力。通過使用GUIDEX對Llama 3.1進行微調,在七個零樣本命名實體識別基準測試中創下了新的最佳紀錄。與先前方法相比,利用GUIDEX訓練的模型在無人為標註資料的情況下,F1分數提升了多達7分,而與之結合時,則提升了近2分。基於GUIDEX訓練的模型展現出對複雜、領域特定註解模式更深入的理解。相關程式碼、模型及合成資料集可於neilus03.github.io/guidex.com獲取。
English
Information Extraction (IE) systems are traditionally domain-specific,
requiring costly adaptation that involves expert schema design, data
annotation, and model training. While Large Language Models have shown promise
in zero-shot IE, performance degrades significantly in unseen domains where
label definitions differ. This paper introduces GUIDEX, a novel method that
automatically defines domain-specific schemas, infers guidelines, and generates
synthetically labeled instances, allowing for better out-of-domain
generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art
across seven zeroshot Named Entity Recognition benchmarks. Models trained with
GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data,
and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX
demonstrate enhanced comprehension of complex, domain-specific annotation
schemas. Code, models, and synthetic datasets are available at
neilus03.github.io/guidex.com