TokenPilot：LLMエージェントのためのキャッシュ効率的なコンテキスト管理

要旨

LLMエージェントが長期セッションで展開されるにつれて、コンテキストの蓄積により推論コストが増大する。既存の手法では、テキストのプルーニングや動的メモリ退避を用いてトークンフットプリントを最小化するが、制約のないシーケンス変異がレイアウトを変更し、プレフィックスの不一致やキャッシュ無効化を引き起こす。これにより、テキストの疎性とプロンプトキャッシュの連続性の間に重要なトレードオフが明らかになる。この課題に対処するため、我々は二重粒度コンテキスト管理フレームワークであるTokenPilotを提案する。グローバルレベルでは、取り込み意識型圧縮がフレームワークのハーネスとして機能し、プロンプトプレフィックスを安定化させ、取り込みゲートにおいてオープンワールド環境ノイズを除去する。ローカルレベルでは、ライフサイクル認識型退避がコンテキストセグメントの継続的な残存有用性を監視し、タスク関連性が失われた場合にのみコンテンツセグメントをオフロードする保守的なバッチターンスケジュールを適用する。孤立モードおよび連続モードの両方でのPinchBenchおよびClaw-Evalにおける実験により、TokenPilotは孤立モードで61%および56%、連続モードで61%および87%のコスト削減を達成しつつ、従来システムと同等の性能を維持することを示す。TokenPilotはLightMem2に統合されており、https://github.com/zjunlp/LightMem2 で入手可能である。

English

As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.