LiteStage: 다단계 추론을 위한 지연 시간 인식 계층 생략

초록

다단계 추론은 복잡한 문제를 순차적인 하위 단계로 분해함으로써 소규모 언어 모델의 추론 능력을 향상시키는 효과적인 전략으로 부상했다. 그러나 이는 지연 시간 증가라는 비용을 수반한다. 우리는 기존의 적응형 가속 기술들, 예를 들어 레이어 생략이 두 가지 주요 문제로 인해 이 설정에서 효율성과 정확성의 균형을 맞추는 데 어려움을 겪고 있음을 관찰했다: (1) 단계별 생략 민감도의 변동성, 그리고 (2) 불필요한 출력 토큰의 생성. 이를 해결하기 위해, 우리는 다단계 추론을 위한 지연 시간 인식 레이어 생략 프레임워크인 LiteStage를 제안한다. LiteStage는 최적의 레이어 예산을 할당하는 단계별 오프라인 탐색과 불필요한 디코딩을 억제하기 위한 온라인 신뢰도 기반 조기 종료를 결합한다. OBQA, CSQA, StrategyQA 등 세 가지 벤치마크에서의 실험 결과, LiteStage는 4.0% 미만의 정확도 손실로 최대 1.70배의 속도 향상을 달성하며, 기존의 학습 없이 적용 가능한 레이어 생략 방법들을 능가하는 성능을 보였다.

English

Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.

LiteStage: 다단계 추론을 위한 지연 시간 인식 계층 생략

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

초록

Support