跳过一层还是循环使用？大语言模型中的层程序学习

摘要

大语言模型（LLMs）通过固定深度与顺序的非循环逐层执行进行推理。我们发现了一种广泛存在的免训练、灵活、动态的"层程序化"（PoLar）范式：预训练层可被打包为模块，随后针对每个输入跳过或循环形成定制化程序。对于大多数输入而言，显著缩短的程序执行即可达到相同或更优的准确率，而原始LLM的错误预测也可通过使用更少层的替代程序加以纠正。这些现象表明，推理过程存在超越标准前向传播的多种有效潜在计算路径。为在实践中高效实现PoLar，我们提出了一种轻量级PoLar预测网络，该网络学习为每个输入生成动态跳过或重复预训练层的执行程序。数学推理基准实验表明，PoLar在多数场景下通过执行更少的层，持续提升了标准推理与现有动态深度方法的准确率，且在分布外评估中仍保持这一优势。我们的结果表明，固定深度执行仅捕获了大语言模型潜在推理能力的一小部分。

English

Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic program-of-layers (PoLar), where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM's latent reasoning capacity.