LayerRoute: LoRA 미세 조정을 통한 에이전트 언어 모델의 입력 조건부 적응형 레이어 생략 기법

초록

에이전트 언어 모델 시스템은 구조적으로 구별되는 두 가지 단계 유형, 즉 구조화된 도구 호출(짧고, 결정론적이며, 낮은 퍼플렉시티)과 개방형 계획/추론 단계(길고, 복잡하며, 높은 퍼플렉시티)를 번갈아 수행한다. 이러한 이질성에도 불구하고, 현재의 추론 시스템은 모든 단계에 동일한 연산량을 적용한다. 본 논문에서는 입력별로 트랜스포머 블록을 선택적으로 건너뛰는 방법을 학습하는 경량 어댑터인 LayerRoute를 제안한다. LayerRoute는 Qwen2.5-0.5B-Instruct의 24개 트랜스포머 블록 각각에 다음 두 가지를 추가한다: (1) 직통 추정기를 통해 하드 이진 게이트를 출력하는 레이어별 라우터(약 897개 매개변수, Linear(896,1)), (2) Q/K/V/O 어텐션 투영에 적용되는 LoRA 어댑터(랭크 8, 약 108만 개 매개변수). 백본 가중치는 고정된 상태로 유지된다. 에이전트 데이터(Hermes, Glaive, GSM8K, Turing)에 대한 단일 종단 간 훈련 패스에 게이트 정규화 항을 추가함으로써 시스템이 입력 유형별로 건너뛸 수 있는 블록을 발견하도록 강제한다. 3,000스텝(A100 40GB에서 6.4분) 후, LayerRoute는 12.91%의 스킵 차이를 달성한다: 도구 호출은 FLOPs의 15.25%를 건너뛰는 반면, 계획 단계는 2.34%만 건너뛰며, 110만 개의 학습 가능 매개변수(4억 9400만 개의 백본 중 0.22%)만을 사용한다. LoRA 적응으로 인해 기본 모델 대비 품질이 향상되었으며, 도구 호출과 계획 단계에서 각각 -1.29와 -1.30의 퍼플렉시티 델타를 보인다.

English

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.