ChatPaper.aiChatPaper

通过语言学习任务预训练提升语言模型的语言能力

Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

January 6, 2026
作者: Atsuki Yamaguchi, Maggie Mi, Nikolaos Aletras
cs.AI

摘要

语言模型(LMs)通过原始文本数据集进行预训练,以逐词元的方式生成文本序列。尽管这种方法有助于学习世界知识和推理能力,但并未显式优化语言能力。为弥补这一不足,我们提出L2T预训练框架,将语言学习任务与标准的下一个词元预测相结合。受人类语言习得过程启发,L2T将原始文本转化为结构化输入-输出对,以提供显性语言刺激。在原始文本与L2T数据的混合集上预训练语言模型,不仅能提升语言能力基准测试的整体表现、加速语言习得进程,还能在通用推理任务中保持竞争优势。
English
Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.
PDF31January 9, 2026