層をスキップするか、ループさせるか？事前学習済みLLMのテスト時深度適応

要旨

事前学習済みのニューラルネットワークは、ファインチューニングなしで異なる入力に適応するためにそのアーキテクチャを変更できるのか？単純なタスクにはすべての層が必要なのか、また困難なタスクに対して十分なのか？我々は、事前学習済みの大規模言語モデル（LLM）の各層を個別のモジュールとして操作し、各テストサンプルにカスタマイズされたより優れた、そしてより浅いモデルを構築できることを発見した。具体的には、事前学習済みモデルの各層をスキップ/プルーニングしたり、リカレントニューラルネットワーク（RNN）として複数回繰り返したり、任意の順序で他の層とスタックしたりすることで、サンプルごとに層の連鎖（CoLa）を生成することができる。この構成可能な空間は、ループ/リカレント事前学習済みモジュール、層プルーニング、または早期終了ネットワークに関する既存の研究の範囲を大幅に拡大する。我々は、数学的および常識的推論のベンチマークから各サンプルに対して最適なCoLaを探索・特定するためのモンテカルロ木探索（MCTS）プロトコルを開発した。固定深度の静的モデルと比較して、CoLaはショートカットパス（速い思考）、同じ層の繰り返し（遅い思考）、およびその両方を組み合わせることを可能にし、異なる入力に対してより柔軟で動的なアーキテクチャを提供する。我々はMCTSで最適化されたCoLaの詳細な分析を行い、2つの重要な発見を得た：（1）元のLLMで正しく予測されたサンプルの75%以上に対して、より短いCoLaを見つけることができ、推論効率を向上させるための大きな余地があることを示唆している；（2）元々誤った予測をしていたサンプルの60%以上に対して、正しい予測を達成するCoLaを特定でき、性能向上のための大きな余地があることを示唆している。我々の結果は、異なるサンプルに対する推論に事前学習済みLLMの固定アーキテクチャを使用することの欠点を浮き彫りにし、テスト時の深度適応の汎化能力を解き放つ道を開くものである。

English

Can a pretrained neural network adapt its architecture to different inputs without any finetuning? Do we need all layers for simple tasks, and are they adequate for challenging tasks? We found that the layers of a pretrained large language model (LLM) can be manipulated as separate modules to build a better and even shallower model customized for each test sample. In particular, each layer from the pretrained model can be skipped/pruned or repeated multiple times as recurrent neural networks (RNN), and stacked with others in arbitrary orders, yielding a chain-of-layers (CoLa) per sample. This compositional space greatly expands the scope of existing works on looped/recurrent pretrained modules, layer pruning, or early-exit networks. We develop a Monte Carlo Tree Search (MCTS) protocol to explore and identify the optimal CoLa for each sample from math and commonsense reasoning benchmarks. Compared to a static model of a fixed depth, CoLa allows shortcut paths (fast thinking), recurrence of the same layer(s) (slow thinking), and combining both, offering more flexible, dynamic architectures for different inputs. We conduct an extensive analysis of the MCTS-optimized CoLa, which leads to two key findings: (1) For >75% of samples with correct predictions by the original LLM, we can find shorter CoLa, suggesting a large space for improving inference efficiency; (2) For >60% of samples with originally incorrect predictions, we can identify CoLa achieving correct predictions, suggesting a large space of performance enhancement. Our results highlight the shortcomings of using a fixed architecture of pre-trained LLMs for inference on different samples and pave the way to unlock the generalization power of test-time depth adaptation.

層をスキップするか、ループさせるか？事前学習済みLLMのテスト時深度適応

Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs

要旨

Support