ChatPaper.aiChatPaper

GuideX:面向零樣本信息抽取的引導式合成數據生成

GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction

May 31, 2025
作者: Neil De La Fuente, Oscar Sainz, Iker García-Ferrero, Eneko Agirre
cs.AI

摘要

資訊抽取(IE)系統傳統上具有領域特定性,需要進行昂貴的適應過程,包括專家模式設計、資料標註及模型訓練。儘管大型語言模型在零樣本資訊抽取中展現出潛力,但在標籤定義不同的未知領域中,其性能顯著下降。本文介紹了GUIDEX,一種新穎的方法,能自動定義領域特定模式、推導指導方針並生成合成標註實例,從而實現更好的跨領域泛化能力。通過使用GUIDEX對Llama 3.1進行微調,在七個零樣本命名實體識別基準測試中創下了新的最佳紀錄。與先前方法相比,利用GUIDEX訓練的模型在無人為標註資料的情況下,F1分數提升了多達7分,而與之結合時,則提升了近2分。基於GUIDEX訓練的模型展現出對複雜、領域特定註解模式更深入的理解。相關程式碼、模型及合成資料集可於neilus03.github.io/guidex.com獲取。
English
Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets are available at neilus03.github.io/guidex.com
PDF32June 9, 2025