言語モデルにおける空白

要旨

Transformerベースの言語モデル（LM）の進歩にもかかわらず、根本的な疑問が未解決のまま残っている：推論時にすべての層が活性化されているのか？この疑問を探るため、我々は非学習型かつパラメータフリーの適応的計算手法であるL2適応的計算（LAC）を用いて、非活性化層（Voidsと呼ぶ）を検出する。LACを元の効率重視の用途から、推論中の活性化層を追跡するために適応させた。この手法は活性化のL2ノルムの変化を監視し、Voidsを特定する。指示チューニングされたLMの層活性化を2つのフェーズで分析する：プロンプト処理（PP）では、入力プロンプトの各トークンに対する活性化層を追跡し、応答生成（RG）では、生成された各トークンに対する活性化層を追跡する。さらに、これら2つのフェーズで異なる層が活性化されることを示す。我々の手法の有効性を示すため、Llama、Mistral、Qwenファミリーの3つの指示チューニングLMを、MMLU、GPQA Diamond、BoolQの3つのベンチマークで評価した。例えば、ゼロショット設定のMMLUでは、Qwen2.5-7B-InstructでVoidsをスキップすることで、69.24から71.29に改善し、モデルは層の30%しか使用しない。同様に、GPQA DiamondでのMistral-7B-Instruct-v0.3は、PPとRGの両フェーズで層の70%を使用することで、13.88から18.36に改善した。これらの結果は、推論時にすべての層が均等に寄与するわけではなく、それらの大部分を選択的にスキップすることで、特定のタスクでのモデルの性能を向上させられることを示している。

English

Despite advances in transformer-based language models (LMs), a fundamental question remains largely unanswered: Are all layers activated during inference? We investigate this question by detecting unactivated layers (which we refer to as Voids) using a non-trainable and parameter-free adaptive computation method called L2 Adaptive Computation (LAC). We adapt LAC from its original efficiency-focused application to trace activated layers during inference. This method monitors changes in the L2-norm of activations to identify voids. We analyze layer activation in instruction-tuned LMs across two phases: Prompt Processing (PP), where we trace activated layers for each token in the input prompts, and Response Generation (RG), where we trace activated layers for each generated token. We further demonstrate that distinct layers are activated during these two phases. To show the effectiveness of our method, we evaluated three distinct instruction-tuned LMs from the Llama, Mistral, and Qwen families on three benchmarks: MMLU, GPQA Diamond, and BoolQ. For example, on MMLU with a zero-shot setting, skipping voids in Qwen2.5-7B-Instruct resulted in an improvement from 69.24 to 71.29 while the model uses only 30% of the layers. Similarly, Mistral-7B-Instruct-v0.3 on GPQA Diamond improved from 13.88 to 18.36 when using 70% of the layers during both the PP and RG phases. These results show that not all layers contribute equally during inference, and that selectively skipping most of them can improve the performance of models on certain tasks.

言語モデルにおける空白

Void in Language Models

要旨

Support