LayerRoute: 通过LoRA微调实现输入条件自适应的层跳跃，用于智能体语言模型

摘要

智能体语言模型系统交替使用两种结构不同的步骤类型：结构化工具调用（短、确定性、低困惑度）与开放式规划/推理步骤（长、复杂、高困惑度）。尽管存在这种异质性，当前推理系统对每个步骤应用相同的计算量。我们提出 LayerRoute，一种轻量级适配器，能够基于每个输入学习选择性跳过 Transformer 模块。LayerRoute 为 Qwen2.5-0.5B-Instruct 中的每个 Transformer 模块（共24层）添加：(1) 一个逐层路由器（约897个参数，Linear(896,1)），通过直通估计器输出硬二值门控；(2) 注意力投影 Q/K/V/O 上的 LoRA 适配器（秩为8，约108万个参数）。骨干网络权重保持冻结。在智能体数据（Hermes、Glaive、GSM8K、Turing）上进行单次端到端训练，并加入门控正则化项，迫使系统发现每个输入类型中哪些模块可跳过。经过3000步训练（在A100 40GB上耗时6.4分钟），LayerRoute 实现12.91%的跳过差异：工具调用跳过15.25%的FLOPs，而规划步骤仅跳过2.34%，仅使用110万个可训练参数（占494M骨干网络的0.22%）。由于LoRA适配，模型质量相较于基线模型有所提升，工具调用和规划步骤的困惑度差值分别为-1.29和-1.30。

English

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.