跳過一層還是循環使用？預訓練大語言模型的測試時深度適應

摘要

預訓練的神經網絡能否在不進行微調的情況下，根據不同輸入調整其架構？對於簡單任務，我們是否需要所有層次，而這些層次又是否足以應對挑戰性任務？我們發現，預訓練的大型語言模型（LLM）的各層可以作為獨立模塊進行操作，從而為每個測試樣本構建出更優甚至更淺的定制模型。具體而言，預訓練模型中的每一層都可以被跳過/剪枝或像循環神經網絡（RNN）那樣多次重複，並以任意順序與其他層堆疊，形成每個樣本的層鏈（CoLa）。這種組合空間極大地擴展了現有工作關於循環/重複預訓練模塊、層剪枝或早期退出網絡的範疇。我們開發了一種蒙特卡洛樹搜索（MCTS）協議，用於探索並識別來自數學和常識推理基準測試中每個樣本的最優CoLa。與固定深度的靜態模型相比，CoLa允許捷徑路徑（快速思考）、同一層次的重複（慢速思考）以及兩者的結合，為不同輸入提供了更靈活、動態的架構。我們對MCTS優化的CoLa進行了廣泛分析，得出了兩個關鍵發現：（1）對於原始LLM預測正確的超過75%的樣本，我們能找到更短的CoLa，這表明在提升推理效率方面存在巨大空間；（2）對於原始預測錯誤的超過60%的樣本，我們能識別出實現正確預測的CoLa，這表明在性能提升方面存在廣闊空間。我們的結果凸顯了使用固定架構的預訓練LLM對不同樣本進行推理的不足，並為解鎖測試時深度適應的泛化能力鋪平了道路。

English

Can a pretrained neural network adapt its architecture to different inputs without any finetuning? Do we need all layers for simple tasks, and are they adequate for challenging tasks? We found that the layers of a pretrained large language model (LLM) can be manipulated as separate modules to build a better and even shallower model customized for each test sample. In particular, each layer from the pretrained model can be skipped/pruned or repeated multiple times as recurrent neural networks (RNN), and stacked with others in arbitrary orders, yielding a chain-of-layers (CoLa) per sample. This compositional space greatly expands the scope of existing works on looped/recurrent pretrained modules, layer pruning, or early-exit networks. We develop a Monte Carlo Tree Search (MCTS) protocol to explore and identify the optimal CoLa for each sample from math and commonsense reasoning benchmarks. Compared to a static model of a fixed depth, CoLa allows shortcut paths (fast thinking), recurrence of the same layer(s) (slow thinking), and combining both, offering more flexible, dynamic architectures for different inputs. We conduct an extensive analysis of the MCTS-optimized CoLa, which leads to two key findings: (1) For >75% of samples with correct predictions by the original LLM, we can find shorter CoLa, suggesting a large space for improving inference efficiency; (2) For >60% of samples with originally incorrect predictions, we can identify CoLa achieving correct predictions, suggesting a large space of performance enhancement. Our results highlight the shortcomings of using a fixed architecture of pre-trained LLMs for inference on different samples and pave the way to unlock the generalization power of test-time depth adaptation.

跳過一層還是循環使用？預訓練大語言模型的測試時深度適應

Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs

摘要

Support