言語学習タスクによる事前学習を通じた言語モデルの言語能力向上

要旨

言語モデル（LM）は、テキスト系列をトークン単位で生成するために生のテキストデータセットで事前学習される。このアプローチは世界知識や推論能力の習得を促進するが、言語能力を明示的に最適化するものではない。この隔たりを埋めるため、我々は標準的な次トークン予測に加えて言語学習タスクを統合した事前学習フレームワークL2Tを提案する。人間の言語習得に着想を得たL2Tは、生テキストを構造化された入力-出力ペアに変換し、明示的な言語的刺激を提供する。生テキストとL2Tデータを混合したデータでLMを事前学習することは、言語能力ベンチマークにおける総合的な性能を向上させるだけでなく、その習得を加速し、一般的な推論タスクにおいても競争力のある性能を維持する。

English

Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.

言語学習タスクによる事前学習を通じた言語モデルの言語能力向上

Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

要旨

Support