TokenPilot：面向大语言模型智能体的缓存高效上下文管理

摘要

随着LLM代理在长周期会话中被部署，上下文积累推高了推理成本。现有方法采用文本剪枝或动态内存驱逐来最小化token占用量，但其无约束的序列变异改变了布局，导致前缀不匹配和缓存失效。这揭示了文本稀疏性与提示缓存连续性之间的关键权衡。为此，我们提出TokenPilot——一种双粒度上下文管理框架。全局层面，感知摄入的压缩作为框架约束机制，在摄入关口稳定提示前缀并消除开放世界环境噪声。局部层面，生命周期感知的驱逐监控上下文片段的持续剩余效用，仅在任务相关性失效时执行保守的批次轮转调度以卸载内容片段。在PinchBench和Claw-Eval上的实验表明，在隔离模式和连续模式下，TokenPilot分别将成本降低61%和56%（隔离模式）以及61%和87%（连续模式），同时保持与先前系统相当的性能。TokenPilot已集成至LightMem2，代码地址为https://github.com/zjunlp/LightMem2。

English

As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.