杜鵑:由LLM巢穴中的大量營養孵化出的IE自由騎士
Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest
February 16, 2025
作者: Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang
cs.AI
摘要
為培育先進的大型語言模型(LLMs),已精心準備了大量高質量的數據,包括預訓練的原始文本和後訓練的標註。相比之下,對於信息提取(IE),很難擴展的是預訓練數據,例如BIO標記的序列。我們展示了IE模型可以利用LLM資源作為免費騎手,將下一個令牌預測重新定義為提取已存在於上下文中的令牌。具體來說,我們提出的下一個令牌提取(NTE)範式學習了一個多功能的IE模型,名為Cuckoo,其中包含從LLM的預訓練和後訓練數據轉換而來的1.026億提取數據。在少樣本設置下,Cuckoo能夠有效地適應傳統和複雜的指令跟隨IE,並且表現優於現有的預訓練IE模型。作為一個免費騎手,Cuckoo可以自然地隨著LLM數據準備的不斷改進而演變,從LLM訓練管道的改進中受益,而無需額外的手動努力。
English
Massive high-quality data, both pre-training raw texts and post-training
annotations, have been carefully prepared to incubate advanced large language
models (LLMs). In contrast, for information extraction (IE), pre-training data,
such as BIO-tagged sequences, are hard to scale up. We show that IE models can
act as free riders on LLM resources by reframing next-token prediction
into extraction for tokens already present in the context. Specifically,
our proposed next tokens extraction (NTE) paradigm learns a versatile IE model,
Cuckoo, with 102.6M extractive data converted from LLM's pre-training
and post-training data. Under the few-shot setting, Cuckoo adapts effectively
to traditional and complex instruction-following IE with better performance
than existing pre-trained IE models. As a free rider, Cuckoo can naturally
evolve with the ongoing advancements in LLM data preparation, benefiting from
improvements in LLM training pipelines without additional manual effort.Summary
AI-Generated Summary