遮罩增強的自回歸預測：減少關注以學習更多

摘要

大型語言模型（LLMs）被發現在準確檢索關鍵信息方面存在問題。為了解決這個問題，我們提出了Mask-Enhanced Autoregressive Prediction（MEAP），這是一種簡單而有效的訓練範式，無縫地將Masked Language Modeling（MLM）整合到Next-Token Prediction（NTP）中，以增強後者的上下文檢索能力。具體而言，MEAP首先隨機遮罩少量輸入標記，然後直接使用僅解碼器的Transformer執行標準的下一標記預測自回歸。MEAP消除了MLM需要雙向注意力或編碼器-解碼器架構的需求，在預訓練或推理過程中不會增加額外的計算負擔。大量實驗表明，MEAP在關鍵信息檢索和長篇上下文推理任務上顯著優於NTP，同時在常識推理任務上表現相當或更好。MEAP的優勢還延伸到監督微調，其中在中間遺失情況下表現出顯著優勢，比NTP高出11.77個百分點。我們的分析表明，MEAP的有效性來自於它通過集中在一組較少的非遮罩標記上來促進更可區分的注意力分數。這種機制提高了模型對任務相關信號的關注，同時減輕了周邊上下文的影響。這些發現將MEAP定位為大型語言模型的一種有前途的訓練範式。

English

Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.

遮罩增強的自回歸預測：減少關注以學習更多

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

摘要

Support