LayerRoute:透過LoRA微調實現輸入條件自適應層跳躍以用於代理型語言模型
LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models
June 1, 2026
作者: Prateek Kumar Sikdar
cs.AI
摘要
代理型語言模型系統在兩種結構迥異的步驟類型之間交替:結構化工具調用(簡短、確定性、低困惑度)與開放式規劃/推理步驟(冗長、複雜、高困惑度)。儘管存在這種異質性,現有推理系統對每個步驟均分配相同的計算量。我們提出 LayerRoute,這是一種輕量級適配器,能夠學習根據每個輸入選擇性地跳過 Transformer 區塊。LayerRoute 在 Qwen2.5-0.5B-Instruct 的 24 個 Transformer 區塊中各添加:(1) 每層路由器(約 897 個參數,Linear(896,1)),透過直通估計器輸出硬二值閘門;(2) 在 Q/K/V/O 注意力投影上的 LoRA 適配器(秩 8,約 108 萬參數)。主幹權重保持凍結。在代理型數據(Hermes、Glaive、GSM8K、Turing)上進行單次端到端訓練,並加入閘門正則化項,強制系統發現每個輸入類型可跳過的區塊。經過 3,000 步訓練(在 A100 40GB 上耗時 6.4 分鐘),LayerRoute 實現了 12.91% 的跳過差異:工具調用跳過 15.25% 的 FLOPs,而規劃步驟僅跳過 2.34%,總共僅使用 110 萬可訓練參數(佔 4.94 億主幹參數的 0.22%)。由於 LoRA 適配,品質較基礎模型有所提升,工具調用的困惑度變化為 -1.29,規劃步驟為 -1.30。
English
Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.