ChatPaper.aiChatPaper

通过语言学习任务的预训练提升语言模型的语言能力

Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

January 6, 2026
作者: Atsuki Yamaguchi, Maggie Mi, Nikolaos Aletras
cs.AI

摘要

語言模型通過在原始文本數據集上進行預訓練,實現逐詞元生成文本序列的能力。儘管這種方法有助於學習世界知識和推理能力,但並未顯式優化語言能力。為彌合這一差距,我們提出L2T預訓練框架,將語言學習任務與標準的下一個詞元預測相結合。受人類語言習得機制啟發,L2T將原始文本轉化為結構化的輸入-輸出對,以提供顯性語言刺激。在混合原始文本與L2T數據上預訓練的語言模型,不僅在語言能力基準測試中展現出整體性能提升和習得加速效應,同時在通用推理任務中保持競爭力。
English
Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.
PDF31January 9, 2026