通过语言学习任务的预训练提升语言模型的语言能力
Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks
January 6, 2026
作者: Atsuki Yamaguchi, Maggie Mi, Nikolaos Aletras
cs.AI
摘要
語言模型通過在原始文本數據集上進行預訓練,實現逐詞元生成文本序列的能力。儘管這種方法有助於學習世界知識和推理能力,但並未顯式優化語言能力。為彌合這一差距,我們提出L2T預訓練框架,將語言學習任務與標準的下一個詞元預測相結合。受人類語言習得機制啟發,L2T將原始文本轉化為結構化的輸入-輸出對,以提供顯性語言刺激。在混合原始文本與L2T數據上預訓練的語言模型,不僅在語言能力基準測試中展現出整體性能提升和習得加速效應,同時在通用推理任務中保持競爭力。
English
Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.