推論モデルのための価値認識型確率的KVキャッシュ退避

要旨

推論モデルは思考連鎖の拡張によって精度を向上させるが、その長い出力はメモリと計算のボトルネックを生み出す。KVキャッシュ追い出し手法は、重要でないキーと値のペアをキャッシュから追い出すことでこのコストを削減するが、完全なKVキャッシュを保持する選択ベースのスパースアテンション代替手法よりも精度が低くなることが多い。我々は、KVキャッシュ追い出しの精度に重要な要因を特定する。第一に、少数の値状態が異常に大きな大きさを持ち、それらを追い出すとモデルが反復的な推論ループに入るという壊滅的な失敗を引き起こす。第二に、追い出し中に確率性を導入することでキャッシュの多様性が向上し、精度が改善される。これらの発見に基づき、我々は値認識確率的KVキャッシュ追い出し（VaSE）を提案する。これは、大きな大きさの値状態を保護し、多様な追い出し判断を促進する学習不要の手法である。6つの推論タスクにおいて、VaSEを用いたQwen3モデルは、同じスパース性でSOTA選択手法よりも高い平均精度を達成し、最も強力な追い出し手法を4%以上上回る。全体として、VaSEは効率性と精度のギャップを埋め、FlashAttention2をサポートし、推論モデルに静的なメモリフットプリントを実現する。

English

Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value-aware Stochastic KV Cache Eviction (VaSE), a training-free recipe that protects large-magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.