River-LLM：基于KV共享的大语言模型无缝退出机制

摘要

大型语言模型（LLMs）虽在多领域展现卓越性能，却日益受限于高推理延迟。早期退出机制通过动态跳过冗余层来加速推理，已成为颇具前景的解决方案。然而在仅含解码器的架构中，早期退出的效率受到KV缓存缺失问题的严重制约——被跳过的层无法为后续令牌提供必要的历史状态。现有方案如重计算或掩码技术，或引入显著延迟开销，或导致严重精度损失，均未能弥合理论层削减与实际加速效果之间的差距。本文提出River-LLM，一种无需训练即可实现令牌级无缝早期退出的框架。该框架引入轻量级KV共享退出河，使主干网络缺失的KV缓存能在退出过程中自然生成并保留，无需昂贵恢复操作。此外，我们利用解码器块内部的状态转移相似性预测累积KV误差，以指导精准退出决策。在数学推理和代码生成任务上的大量实验表明，River-LLM在保持高质量生成结果的同时，可实现1.71至2.16倍的实际加速效果。

English

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.