跳过一层还是循环使用？预训练大语言模型的测试时深度自适应

摘要

预训练神经网络能否在不进行微调的情况下，根据不同的输入自适应调整其架构？对于简单任务，我们是否需要所有层，而这些层是否足以应对复杂任务？我们发现，预训练大型语言模型（LLM）的各层可作为独立模块进行操控，从而为每个测试样本构建出更优甚至更浅的定制模型。具体而言，预训练模型中的每一层都可以被跳过/剪枝，或像循环神经网络（RNN）那样多次重复，并与其他层以任意顺序堆叠，形成针对每个样本的层链（CoLa）。这种组合空间极大地扩展了现有工作关于循环/重复预训练模块、层剪枝或早期退出网络的研究范畴。我们开发了一种蒙特卡洛树搜索（MCTS）协议，用于探索并识别来自数学和常识推理基准测试中每个样本的最优CoLa。与固定深度的静态模型相比，CoLa允许快捷路径（快速思考）、同一层的重复（慢速思考）以及两者的结合，为不同输入提供了更加灵活、动态的架构。我们对MCTS优化的CoLa进行了广泛分析，得出两个关键发现：（1）对于原始LLM预测正确的样本中超过75%的案例，我们能够找到更短的CoLa，这表明在提升推理效率方面存在巨大空间；（2）对于原始预测错误的样本中超过60%的案例，我们能够识别出实现正确预测的CoLa，这表明在性能提升方面存在广阔空间。我们的研究结果凸显了使用固定架构的预训练LLM对不同样本进行推理的不足，并为解锁测试时深度自适应泛化能力开辟了道路。

English

Can a pretrained neural network adapt its architecture to different inputs without any finetuning? Do we need all layers for simple tasks, and are they adequate for challenging tasks? We found that the layers of a pretrained large language model (LLM) can be manipulated as separate modules to build a better and even shallower model customized for each test sample. In particular, each layer from the pretrained model can be skipped/pruned or repeated multiple times as recurrent neural networks (RNN), and stacked with others in arbitrary orders, yielding a chain-of-layers (CoLa) per sample. This compositional space greatly expands the scope of existing works on looped/recurrent pretrained modules, layer pruning, or early-exit networks. We develop a Monte Carlo Tree Search (MCTS) protocol to explore and identify the optimal CoLa for each sample from math and commonsense reasoning benchmarks. Compared to a static model of a fixed depth, CoLa allows shortcut paths (fast thinking), recurrence of the same layer(s) (slow thinking), and combining both, offering more flexible, dynamic architectures for different inputs. We conduct an extensive analysis of the MCTS-optimized CoLa, which leads to two key findings: (1) For >75% of samples with correct predictions by the original LLM, we can find shorter CoLa, suggesting a large space for improving inference efficiency; (2) For >60% of samples with originally incorrect predictions, we can identify CoLa achieving correct predictions, suggesting a large space of performance enhancement. Our results highlight the shortcomings of using a fixed architecture of pre-trained LLMs for inference on different samples and pave the way to unlock the generalization power of test-time depth adaptation.

跳过一层还是循环使用？预训练大语言模型的测试时深度自适应

Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs

摘要

Support