ChatPaper.aiChatPaper

TokenPilot:面向大语言模型智能体的缓存高效上下文管理

TokenPilot: Cache-Efficient Context Management for LLM Agents

June 15, 2026
作者: Buqiang Xu, Zirui Xue, Dianmou Chen, Chenyang Fu, Chiyu Wu, Caiying Huang, Chen Jiang, Jizhan Fang, Xinle Deng, Yijun Chen, Yunzhi Yao, Xuehai Wang, Jin Shang, Gong Yu, Ningyu Zhang
cs.AI

摘要

随着LLM代理在长周期会话中被部署,上下文积累推高了推理成本。现有方法采用文本剪枝或动态内存驱逐来最小化token占用量,但其无约束的序列变异改变了布局,导致前缀不匹配和缓存失效。这揭示了文本稀疏性与提示缓存连续性之间的关键权衡。为此,我们提出TokenPilot——一种双粒度上下文管理框架。全局层面,感知摄入的压缩作为框架约束机制,在摄入关口稳定提示前缀并消除开放世界环境噪声。局部层面,生命周期感知的驱逐监控上下文片段的持续剩余效用,仅在任务相关性失效时执行保守的批次轮转调度以卸载内容片段。在PinchBench和Claw-Eval上的实验表明,在隔离模式和连续模式下,TokenPilot分别将成本降低61%和56%(隔离模式)以及61%和87%(连续模式),同时保持与先前系统相当的性能。TokenPilot已集成至LightMem2,代码地址为https://github.com/zjunlp/LightMem2。
English
As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.