ChatPaper.aiChatPaper

Dr.LLM:大语言模型中的动态层路由技术

Dr.LLM: Dynamic Layer Routing in LLMs

October 14, 2025
作者: Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh
cs.AI

摘要

大型语言模型(LLMs)在处理每个标记时需遍历整个Transformer堆栈的所有层,这导致在处理简单查询时存在计算浪费,而在需要更深层次推理的复杂查询上又缺乏足够的灵活性。自适应深度方法虽能提升效率,但以往方案依赖于昂贵的推理时搜索、架构改动或大规模重训练,实践中往往在效率提升的同时牺牲了准确性。我们提出了Dr.LLM——LLM层的动态路由框架,这一可即插即用的框架为预训练模型配备了轻量级的逐层路由器,决定跳过、执行或重复某一模块。路由器通过显式监督进行训练:利用蒙特卡洛树搜索(MCTS),我们导出了在计算预算内保持或提升准确性的高质量层配置。我们的设计包括用于稳定路由的窗口池化、类别平衡的焦点损失以及瓶颈MLP路由器,确保了在类别不平衡和长序列情况下的鲁棒性。在ARC(逻辑)和DART(数学)任务上,Dr.LLM将准确性最高提升了+3.4个百分点,同时平均每个示例节省了5层计算。路由器在跨域任务(如MMLU、GSM8k、AIME、TruthfulQA、SQuADv2、GPQA、PIQA、AGIEval)上仅损失0.85%的准确性,同时保持了效率,并比以往路由方法最高提升了+7.7个百分点的准确性。总体而言,Dr.LLM展示了显式监督的路由器能够在不改变基础权重的情况下,为冻结的LLMs实现预算感知、准确性驱动的推理。
English
Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.
PDF302October 15, 2025