LiteStage：面向多阶段推理的延迟感知层跳过策略

摘要

多阶段推理作为一种有效策略，通过将复杂问题分解为顺序子阶段，显著提升了小型语言模型的推理能力。然而，这一策略也带来了延迟增加的问题。我们观察到，现有的自适应加速技术，如层跳过，在此情境下难以平衡效率与准确性，主要面临两大挑战：(1) 各阶段对跳过的敏感性差异，以及(2) 冗余输出令牌的生成。针对这些问题，我们提出了LiteStage，一个面向多阶段推理的延迟感知层跳过框架。LiteStage结合了阶段性的离线搜索，以分配最优层预算，并采用基于置信度的在线生成提前终止机制，以抑制不必要的解码过程。在OBQA、CSQA和StrategyQA三个基准测试上的实验表明，LiteStage实现了最高1.70倍的加速，且准确率损失低于4.0%，优于先前的无需训练的层跳过方法。

English

Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.

LiteStage：面向多阶段推理的延迟感知层跳过策略

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

摘要

Support