マスク強化自己回帰予測：学習を向上させるために注意を削減

要旨

大規模言語モデル（LLMs）は、主要情報を正確に取得する際に問題を抱えていることが明らかになっています。この課題に対処するために、私たちはMask-Enhanced Autoregressive Prediction（MEAP）を提案します。これは、Masked Language Modeling（MLM）をNext-Token Prediction（NTP）にシームレスに統合することで、後者のコンテキスト内での取得能力を向上させるシンプルかつ効果的なトレーニングパラダイムです。具体的には、MEAPは最初に入力トークンの一部をランダムにマスクし、次にデコーダーのみを使用して標準の次トークン予測を自己回帰的に行います。MEAPは、MLMのための双方向アテンションやエンコーダーデコーダーアーキテクチャの必要性を排除し、事前トレーニングや推論時の追加計算負荷を発生させません。集中的な実験により、MEAPが主要情報の取得や長いコンテキスト推論タスクでNTPを大幅に上回り、常識的推論タスクでは同等以上の性能を発揮することが示されました。MEAPの利点は、監督されたファインチューニングにも適用され、中途で迷子になるシナリオでNTPを11.77パーセントポイント上回る驚異的な利点を示します。私たちの分析によると、MEAPの効果は、非マスク化されたトークンの縮小されたセットに集中することで、より区別可能なアテンションスコアを促進する能力から生じています。このメカニズムにより、モデルはタスクに関連する信号に焦点を当てることができ、周辺コンテキストの影響を軽減します。これらの知見から、MEAPは大規模言語モデルの有望なトレーニングパラダイムとして位置付けられます。

English

Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.

マスク強化自己回帰予測：学習を向上させるために注意を削減

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

要旨

Support