通过语言学习任务预训练提升语言模型的语言能力
Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks
January 6, 2026
作者: Atsuki Yamaguchi, Maggie Mi, Nikolaos Aletras
cs.AI
摘要
语言模型(LMs)通过原始文本数据集进行预训练,以逐词元的方式生成文本序列。尽管这种方法有助于学习世界知识和推理能力,但并未显式优化语言能力。为弥补这一不足,我们提出L2T预训练框架,将语言学习任务与标准的下一个词元预测相结合。受人类语言习得过程启发,L2T将原始文本转化为结构化输入-输出对,以提供显性语言刺激。在原始文本与L2T数据的混合集上预训练语言模型,不仅能提升语言能力基准测试的整体表现、加速语言习得进程,还能在通用推理任务中保持竞争优势。
English
Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.