Dr.LLM: 대형 언어 모델에서의 동적 계층 라우팅

초록

대형 언어 모델(LLMs)은 모든 토큰을 트랜스포머 스택의 모든 레이어를 통해 처리하므로, 간단한 질의에는 불필요한 계산이 발생하고 더 깊은 추론이 필요한 복잡한 질의에는 유연성이 부족합니다. 적응적 깊이 방법은 효율성을 개선할 수 있지만, 기존 접근법은 비용이 많이 드는 추론 시 탐색, 아키텍처 변경, 또는 대규모 재훈련에 의존하며, 실제로는 효율성 향상에도 불구하고 정확도가 저하되는 경우가 많습니다. 우리는 Dr.LLM(Dynamic routing of Layers for LLMs)을 소개합니다. 이는 사전 훈련된 모델에 경량의 레이어별 라우터를 추가하여 블록을 건너뛰거나 실행하거나 반복할지 결정하는 후속 가능한 프레임워크입니다. 라우터는 명시적 감독 하에 훈련됩니다: 몬테카를로 트리 탐색(MCTS)을 사용하여 계산 예산 내에서 정확도를 유지하거나 개선하는 고품질 레이어 구성을 도출합니다. 우리의 설계, 안정적인 라우팅을 위한 윈도우 풀링, 클래스 균형을 고려한 포커스 손실, 그리고 병목 현상을 방지하는 MLP 라우터는 클래스 불균형과 긴 시퀀스에서도 견고성을 보장합니다. ARC(논리)와 DART(수학)에서 Dr.LLM은 평균적으로 예제당 5개의 레이어를 절약하면서 정확도를 최대 +3.4%p까지 향상시켰습니다. 라우터는 도메인 외 작업(MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval)에서도 효율성을 유지하면서 정확도가 단 0.85% 하락하는 수준으로 일반화되었으며, 기존 라우팅 방법보다 최대 +7.7%p 더 나은 성능을 보였습니다. 전반적으로, Dr.LLM은 명시적 감독 하에 훈련된 라우터가 기본 가중치를 변경하지 않고도 예산을 고려한 정확도 중심의 추론을 위해 고정된 LLMs를 후속적으로 개선할 수 있음을 보여줍니다.

English

Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.

Dr.LLM: 대형 언어 모델에서의 동적 계층 라우팅

Dr.LLM: Dynamic Layer Routing in LLMs

초록

Support