TokenPilot: LLM 에이전트를 위한 캐시 효율적 컨텍스트 관리

초록

LLM 에이전트가 장기 세션에서 배포됨에 따라, 컨텍스트 누적이 추론 비용을 증가시킨다. 기존 접근 방식은 텍스트 가지치기나 동적 메모리 제거를 활용하여 토큰 사용량을 최소화하지만, 이로 인한 제약 없는 시퀀스 변형은 레이아웃을 변경시켜 접두사 불일치와 캐시 무효화를 초래한다. 이는 텍스트 희소성과 프롬프트 캐시 연속성 간의 중요한 상충 관계를 드러낸다. 이를 해결하기 위해, 우리는 이중 세분성 컨텍스트 관리 프레임워크인 TokenPilot을 제시한다. 전역적으로, 수집 인식 압축(Ingestion-Aware Compaction)은 프레임워크 핸들 역할을 하여 프롬프트 접두사를 안정화하고, 수집 게이트에서 개방형 환경 노이즈를 제거한다. 지역적으로, 생애주기 인식 제거(Lifecycle-Aware Eviction)는 컨텍스트 세그먼트의 잔여 유틸리티를 지속적으로 모니터링하며, 태스크 관련성이 만료될 때만 콘텐츠 세그먼트를 오프로드하도록 보수적인 배치-턴 일정을 적용한다. PinchBench와 Claw-Eval에서 단독 모드와 연속 모드 모두로 수행된 실험은, TokenPilot이 단독 모드에서 비용을 각각 61%와 56%, 연속 모드에서 각각 61%와 87% 감소시키면서도 기존 시스템과 경쟁력 있는 성능을 유지함을 보여준다. TokenPilot은 https://github.com/zjunlp/LightMem2에서 LightMem2에 통합되었다.

English

As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.