跳過一層還是循環它？學習大型語言模型中的層程序

摘要

大型語言模型（LLMs）透過遵循固定深度與順序、非遞迴地執行所有層來進行推論。我們揭示了無需訓練、靈活且動態的「層級程式」（PoLar）廣泛存在——其中預訓練層可被封裝為模組，並根據每個輸入跳過或循環，以形成客製化的執行程式。對於多數輸入而言，大幅縮短的程式執行不僅能達到相同甚至更高的準確率，而原始LLM的錯誤預測亦可透過使用較少層的替代程式加以修正。這些觀察表明，推論過程中存在多種超越標準前向傳播的有效潛在計算方式。為在實務中高效實現PoLar，我們提出一個輕量級PoLar預測網路，該網路學習針對每個輸入生成動態跳過或重複預訓練層的執行程式。在數學推理基準上的實驗顯示，PoLar在準確率上持續優於標準推論及先前的動態深度方法，且往往在執行較少層的同時達成此效果；這些優勢在分佈外評估中仍得以維持。我們的結果表明，固定深度執行僅捕捉到LLM潛在推理能力中狹隘的一部分。

English

Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic program-of-layers (PoLar), where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM's latent reasoning capacity.