River-LLM：基于KV共享的大语言模型无缝退出机制

摘要

大型語言模型（LLMs）在各領域展現出卓越性能，但其推論過程正日益受到高延遲問題的制約。早期退出機制通過動態跳過冗餘層來加速推論，已成爲備受關注的解決方案。然而在僅解碼器架構中，早期退出的效率受到KV緩存缺失問題的嚴重制約——被跳過的層無法爲後續詞元提供必要的歷史狀態。現有解決方案（如重計算或掩碼技術）要么引入顯著延遲開銷，要么導致嚴重精度損失，難以彌合理論層數削減與實際牆鐘加速之間的差距。本文提出River-LLM，一種無需訓練即可實現詞元級無縫早期退出的框架。該框架引入輕量級KV共享退出河，使骨幹網絡缺失的KV緩存在退出過程中自然生成並保留，無需耗時的恢復操作。此外，我們利用解碼器模塊內的狀態轉移相似性預測累積KV誤差，從而指導精準的退出決策。在數學推理與代碼生成任務上的大量實驗表明，River-LLM在保持高生成質量的同時，可實現1.71至2.16倍的實際加速比。

English

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.

River-LLM：基于KV共享的大语言模型无缝退出机制

River-LLM: Large Language Model Seamless Exit Based on KV Share

摘要

Support