LLM博士:大型語言模型中的動態層路由
Dr.LLM: Dynamic Layer Routing in LLMs
October 14, 2025
作者: Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh
cs.AI
摘要
大型語言模型(LLMs)在處理每個詞元時,需通過變壓器堆疊的所有層,這導致在處理簡單查詢時存在計算浪費,而在需要更深層推理的複雜查詢上則缺乏足夠的靈活性。自適應深度方法雖能提升效率,但先前的方法依賴於昂貴的推理時搜索、架構變更或大規模重新訓練,且實際應用中常因效率提升而犧牲準確性。我們引入了Dr.LLM,即LLM層的動態路由框架,這是一種可後裝的框架,為預訓練模型配備了輕量級的逐層路由器,決定跳過、執行或重複某個模塊。路由器通過顯式監督進行訓練:利用蒙特卡羅樹搜索(MCTS),我們導出在計算預算內保持或提升準確性的高質量層配置。我們的設計,包括用於穩定路由的窗口池化、帶類別平衡的焦點損失,以及瓶頸MLP路由器,確保了在類別不平衡和長序列情況下的魯棒性。在ARC(邏輯)和DART(數學)任務上,Dr.LLM將準確性提升了最多+3.4%個百分點,同時平均每個示例節省了5層。路由器在跨域任務(如MMLU、GSM8k、AIME、TruthfulQA、SQuADv2、GPQA、PIQA、AGIEval)上僅有0.85%的準確性下降,同時保持了效率,並在路由方法上最多領先+7.7%個百分點。總體而言,Dr.LLM展示了顯式監督的路由器能夠在不改變基礎權重的情況下,為凍結的LLMs進行預算感知、準確性驅動的推理提供後裝支持。
English
Large Language Models (LLMs) process every token through all layers of a
transformer stack, causing wasted computation on simple queries and
insufficient flexibility for harder ones that need deeper reasoning.
Adaptive-depth methods can improve efficiency, but prior approaches rely on
costly inference-time search, architectural changes, or large-scale retraining,
and in practice often degrade accuracy despite efficiency gains. We introduce
Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that
equips pretrained models with lightweight per-layer routers deciding to skip,
execute, or repeat a block. Routers are trained with explicit supervision:
using Monte Carlo Tree Search (MCTS), we derive high-quality layer
configurations that preserve or improve accuracy under a compute budget. Our
design, windowed pooling for stable routing, focal loss with class balancing,
and bottleneck MLP routers, ensures robustness under class imbalance and long
sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to
+3.4%p while saving 5 layers per example on average. Routers generalize to
out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA,
AGIEval) with only 0.85% accuracy drop while retaining efficiency, and
outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that
explicitly supervised routers retrofit frozen LLMs for budget-aware,
accuracy-driven inference without altering base weights.