杜鵑：由LLM巢穴中的大量營養孵化出的IE自由騎士

摘要

為培育先進的大型語言模型（LLMs），已精心準備了大量高質量的數據，包括預訓練的原始文本和後訓練的標註。相比之下，對於信息提取（IE），很難擴展的是預訓練數據，例如BIO標記的序列。我們展示了IE模型可以利用LLM資源作為免費騎手，將下一個令牌預測重新定義為提取已存在於上下文中的令牌。具體來說，我們提出的下一個令牌提取（NTE）範式學習了一個多功能的IE模型，名為Cuckoo，其中包含從LLM的預訓練和後訓練數據轉換而來的1.026億提取數據。在少樣本設置下，Cuckoo能夠有效地適應傳統和複雜的指令跟隨IE，並且表現優於現有的預訓練IE模型。作為一個免費騎手，Cuckoo可以自然地隨著LLM數據準備的不斷改進而演變，從LLM訓練管道的改進中受益，而無需額外的手動努力。

English

Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token prediction into extraction for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, Cuckoo, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.

杜鵑：由LLM巢穴中的大量營養孵化出的IE自由騎士

Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest

摘要

Support