レイヤーをスキップするか、ループさせるか？ LLMにおけるProgram-of-Layersの学習

要旨

大規模言語モデル（LLM）は、固定された深さと順序で、全層を非再帰的に実行することで推論を行う。本稿では、学習を必要としない柔軟で動的な「層のプログラム（PoLar）」が広く存在することを明らかにする。このPoLarでは、事前学習済み層をモジュールとしてまとめ、スキップまたはループすることで、各入力に合わせたカスタマイズプログラムを構成できる。多くの入力では、大幅に短いプログラム実行で同等以上の精度が達成され、元のLLMの誤予測は、より少ない層を用いた代替プログラムによって修正可能である。これらの観察結果は、推論において標準の順伝播を超えた複数の有効な潜在計算が存在することを示している。実用的にPoLarを効率的に実現するため、軽量なPoLar予測ネットワークを提案する。これは、各入力に対して事前学習済み層を動的にスキップまたは繰り返す実行プログラムを学習して生成するものである。数学的推論ベンチマークによる実験では、PoLarが標準推論および従来の動的深さ手法と比較して一貫して精度を向上させ、多くの場合、より少ない層の実行でこれを達成し、分布外評価においてもその利得が持続することを示す。これらの結果は、固定深さの実行がLLMの潜在的な推論能力のごく一部しか捉えていないことを示唆している。

English

Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic program-of-layers (PoLar), where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM's latent reasoning capacity.