ChatPaper.aiChatPaper

杜鵑:由LLM巢穴中的大量營養孵化出的IE自由騎士

Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest

February 16, 2025
作者: Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang
cs.AI

摘要

為培育先進的大型語言模型(LLMs),已精心準備了大量高質量的數據,包括預訓練的原始文本和後訓練的標註。相比之下,對於信息提取(IE),很難擴展的是預訓練數據,例如BIO標記的序列。我們展示了IE模型可以利用LLM資源作為免費騎手,將下一個令牌預測重新定義為提取已存在於上下文中的令牌。具體來說,我們提出的下一個令牌提取(NTE)範式學習了一個多功能的IE模型,名為Cuckoo,其中包含從LLM的預訓練和後訓練數據轉換而來的1.026億提取數據。在少樣本設置下,Cuckoo能夠有效地適應傳統和複雜的指令跟隨IE,並且表現優於現有的預訓練IE模型。作為一個免費騎手,Cuckoo可以自然地隨著LLM數據準備的不斷改進而演變,從LLM訓練管道的改進中受益,而無需額外的手動努力。
English
Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token prediction into extraction for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, Cuckoo, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.

Summary

AI-Generated Summary

PDF62February 18, 2025