語言模型中的空白
Void in Language Models
May 20, 2025
作者: Mani Shemiranifar
cs.AI
摘要
儘管基於Transformer的語言模型(LMs)取得了進展,一個根本性問題仍然未得到充分解答:在推理過程中,所有層是否都被激活?我們通過檢測未激活層(我們稱之為“空洞”)來探討這一問題,使用了一種名為L2自適應計算(LAC)的非訓練且無參數的自適應計算方法。我們將LAC從其最初專注於效率的應用調整為追踪推理過程中的激活層。該方法通過監控激活的L2範數變化來識別空洞。我們分析了指令微調LMs在兩個階段的層激活情況:提示處理(PP)階段,我們追踪輸入提示中每個詞元的激活層;以及響應生成(RG)階段,我們追踪生成每個詞元時的激活層。我們進一步證明,在這兩個階段中激活的是不同的層。為了展示我們方法的有效性,我們評估了來自Llama、Mistral和Qwen家族的三種不同指令微調LMs在三個基準測試上的表現:MMLU、GPQA Diamond和BoolQ。例如,在零樣本設置下的MMLU測試中,跳過Qwen2.5-7B-Instruct中的空洞,其性能從69.24提升至71.29,而模型僅使用了30%的層。同樣,在GPQA Diamond測試中,Mistral-7B-Instruct-v0.3在PP和RG階段使用70%的層時,其性能從13.88提升至18.36。這些結果表明,並非所有層在推理過程中都同等貢獻,選擇性地跳過大部分層可以在某些任務上提升模型性能。
English
Despite advances in transformer-based language models (LMs), a fundamental
question remains largely unanswered: Are all layers activated during inference?
We investigate this question by detecting unactivated layers (which we refer to
as Voids) using a non-trainable and parameter-free adaptive computation method
called L2 Adaptive Computation (LAC). We adapt LAC from its original
efficiency-focused application to trace activated layers during inference. This
method monitors changes in the L2-norm of activations to identify voids. We
analyze layer activation in instruction-tuned LMs across two phases: Prompt
Processing (PP), where we trace activated layers for each token in the input
prompts, and Response Generation (RG), where we trace activated layers for each
generated token. We further demonstrate that distinct layers are activated
during these two phases. To show the effectiveness of our method, we evaluated
three distinct instruction-tuned LMs from the Llama, Mistral, and Qwen families
on three benchmarks: MMLU, GPQA Diamond, and BoolQ. For example, on MMLU with a
zero-shot setting, skipping voids in Qwen2.5-7B-Instruct resulted in an
improvement from 69.24 to 71.29 while the model uses only 30% of the layers.
Similarly, Mistral-7B-Instruct-v0.3 on GPQA Diamond improved from 13.88 to
18.36 when using 70% of the layers during both the PP and RG phases. These
results show that not all layers contribute equally during inference, and that
selectively skipping most of them can improve the performance of models on
certain tasks.