LayerRoute: エージェント型言語モデルのためのLoRAファインチューニングによる入力条件付き適応的レイヤスキップ

要旨

エージェント型言語モデルシステムは、構造的に異なる2種類のステップ、すなわち構造化ツール呼び出し（短く、決定論的で、低パープレキシティ）と、自由な計画・推論ステップ（長く、複雑で、高パープレキシティ）を交互に実行する。この不均一性にもかかわらず、現在の推論システムはすべてのステップに同一の計算を適用している。そこで我々は、LayerRouteを導入する。これは、入力ごとにトランスフォーマーブロックを選択的にスキップすることを学習する軽量なアダプターである。LayerRouteは、Qwen2.5-0.5B-Instructの24個のトランスフォーマーブロックのそれぞれに、(1) ストレートスルー推定器を介してハードなバイナリゲートを出力するレイヤー単位のルーター（約897パラメータ、Linear(896,1)）、および(2) Q/K/V/Oアテンション射影に適用されるLoRAアダプター（ランク8、約108万パラメータ）を追加する。バックボーンの重みは凍結される。エージェント型データ（Hermes、Glaive、GSM8K、Turing）に対するゲート正則化項を用いた単一のエンドツーエンド学習パスにより、システムは入力タイプごとにどのブロックがスキップ可能かを発見する。3,000ステップ（A100 40GB上で6.4分）後、LayerRouteは12.91%のスキップ差分を達成する。すなわち、ツール呼び出しではFLOPsの15.25%をスキップするのに対し、計画ステップでは2.34%のみをスキップし、使用する訓練可能パラメータはわずか110万（4億9400万のバックボーンの0.22%）である。LoRA適応により、ベースモデルと比較して品質が向上し、パープレキシティ差分はツール呼び出しで-1.29、計画で-1.30となる。

English

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.