LayerRoute：透過LoRA微調實現輸入條件自適應層跳躍以用於代理型語言模型

摘要

代理型語言模型系統在兩種結構迥異的步驟類型之間交替：結構化工具調用（簡短、確定性、低困惑度）與開放式規劃/推理步驟（冗長、複雜、高困惑度）。儘管存在這種異質性，現有推理系統對每個步驟均分配相同的計算量。我們提出 LayerRoute，這是一種輕量級適配器，能夠學習根據每個輸入選擇性地跳過 Transformer 區塊。LayerRoute 在 Qwen2.5-0.5B-Instruct 的 24 個 Transformer 區塊中各添加：(1) 每層路由器（約 897 個參數，Linear(896,1)），透過直通估計器輸出硬二值閘門；(2) 在 Q/K/V/O 注意力投影上的 LoRA 適配器（秩 8，約 108 萬參數）。主幹權重保持凍結。在代理型數據（Hermes、Glaive、GSM8K、Turing）上進行單次端到端訓練，並加入閘門正則化項，強制系統發現每個輸入類型可跳過的區塊。經過 3,000 步訓練（在 A100 40GB 上耗時 6.4 分鐘），LayerRoute 實現了 12.91% 的跳過差異：工具調用跳過 15.25% 的 FLOPs，而規劃步驟僅跳過 2.34%，總共僅使用 110 萬可訓練參數（佔 4.94 億主幹參數的 0.22%）。由於 LoRA 適配，品質較基礎模型有所提升，工具調用的困惑度變化為 -1.29，規劃步驟為 -1.30。

English

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.