遮罩增強的自回歸預測:減少關注以學習更多
Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
February 11, 2025
作者: Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu
cs.AI
摘要
大型語言模型(LLMs)被發現在準確檢索關鍵信息方面存在問題。為了解決這個問題,我們提出了Mask-Enhanced Autoregressive Prediction(MEAP),這是一種簡單而有效的訓練範式,無縫地將Masked Language Modeling(MLM)整合到Next-Token Prediction(NTP)中,以增強後者的上下文檢索能力。具體而言,MEAP首先隨機遮罩少量輸入標記,然後直接使用僅解碼器的Transformer執行標準的下一標記預測自回歸。MEAP消除了MLM需要雙向注意力或編碼器-解碼器架構的需求,在預訓練或推理過程中不會增加額外的計算負擔。大量實驗表明,MEAP在關鍵信息檢索和長篇上下文推理任務上顯著優於NTP,同時在常識推理任務上表現相當或更好。MEAP的優勢還延伸到監督微調,其中在中間遺失情況下表現出顯著優勢,比NTP高出11.77個百分點。我們的分析表明,MEAP的有效性來自於它通過集中在一組較少的非遮罩標記上來促進更可區分的注意力分數。這種機制提高了模型對任務相關信號的關注,同時減輕了周邊上下文的影響。這些發現將MEAP定位為大型語言模型的一種有前途的訓練範式。
English
Large Language Models (LLMs) are discovered to suffer from accurately
retrieving key information. To address this, we propose Mask-Enhanced
Autoregressive Prediction (MEAP), a simple yet effective training paradigm that
seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction
(NTP) to enhance the latter's in-context retrieval capabilities. Specifically,
MEAP first randomly masks a small fraction of input tokens and then directly
performs the standard next-token prediction autoregressive using a decoder-only
Transformer. MEAP eliminates the need for bidirectional attention or
encoder-decoder architectures for MLM, incurring no additional computational
overhead during pre-training or inference. Intensive experiments demonstrate
that MEAP substantially outperforms NTP on key information retrieval and
long-context reasoning tasks, while performing on par or better on commonsense
reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning,
where it shows remarkable advantages in lost-in-the-middle scenarios,
outperforming NTP by 11.77 percentage points. Our analysis indicates that
MEAP's effectiveness arises from its ability to promote more distinguishable
attention scores by concentrating on a reduced set of non-masked tokens. This
mechanism improves the model's focus on task-relevant signals while mitigating
the influence of peripheral context. These findings position MEAP as a
promising training paradigm for large language models.Summary
AI-Generated Summary