TokenPilot:針對LLM代理的快取高效上下文管理
TokenPilot: Cache-Efficient Context Management for LLM Agents
June 15, 2026
作者: Buqiang Xu, Zirui Xue, Dianmou Chen, Chenyang Fu, Chiyu Wu, Caiying Huang, Chen Jiang, Jizhan Fang, Xinle Deng, Yijun Chen, Yunzhi Yao, Xuehai Wang, Jin Shang, Gong Yu, Ningyu Zhang
cs.AI
摘要
隨著大語言模型代理在長時間跨度的會話中被部署,上下文的積累會推升推理成本。現有方法採用文本剪枝或動態記憶驅逐來最小化令牌足跡;然而,其不受約束的序列突變會改變佈局,導致前綴不匹配與快取失效。這揭示了文本稀疏性與提示快取連續性之間的一個關鍵權衡。為此,我們提出TokenPilot,一個雙粒度上下文管理框架。在全球層面,感知攝取的壓縮作為框架工具,在攝取入口處穩定提示前綴並消除開放世界環境雜訊。在局部層面,感知生命週期的驅逐會監控上下文片段當前的殘餘效用,僅在任務相關性過期時才執行保守的批次輪次排程以卸載內容片段。在PinchBench與Claw-Eval上以隔離模式與連續模式進行的實驗表明,TokenPilot在隔離模式下分別降低了61%與56%的成本,在連續模式下分別降低了61%與87%的成本,同時相比於先前系統保持了具有競爭力的性能。TokenPilot已整合至LightMem2中,網址為https://github.com/zjunlp/LightMem2。
English
As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.