자기 압축 언어 모델 에이전트

초록

긴 에이전트 추적(trace)은 사고 사슬과 도구 호출로 구성되며, 시간이 지남에 따라 오래된 내용이 축적되어 이후 생성 과정을 고정시키고 결국 문맥 창(context window)을 초과하게 된다. 기존 스캐폴드(scaffold)는 토큰 임계값에서 촉발되는 고정 간격 압축(compaction)으로 이를 완화한다. 이러한 촉발 방식은 궤적 구조를 고려하지 않아 도출 중간이나 검색 중간에 있는 부분 결과가 폐기될 위험이 있다. 본 논문에서는 모델 스스로가 언제, 어떻게 압축할지를 결정할 수 있는 스캐폴드인 SelfCompact를 제안한다. 구체적으로, 이는 추론 시점의 두 요소를 결합한다: (i) 모델이 누적된 문맥을 요약하기 위해 호출하는 압축 도구, 그리고 (ii) 언제 발동할지(하위 작업이 해결되었거나 궤적이 수렴 중일 때)와 언제 억제할지(도출 중간이거나 막혔을 때)를 지정하는 경량의 루브릭(rubric)이다. 이 두 요소는 모두 필요하다. 도구만 단독으로 사용하면 오픈 가중치 모델에서 활용도가 고르지 않아 도움이 되지 않는 순간에 호출되거나 전혀 호출되지 않는 경우가 많다. 루브릭만으로는 행동을 취할 수 없다. 이 둘을 함께 사용할 경우, 미세 조정이나 외부 감독 없이도 효과적인 적응형 압축을 이끌어낼 수 있다. 우리는 6개 벤치마크(경쟁 수학 및 에이전트 검색)와 7개 모델에 대한 실증적 결과를 제시한다. 실험 결과, SelfCompact는 고정 간격 요약과 동등하거나 더 나은 성능을 훨씬 적은 토큰 비용으로 달성하며, 요약이 없는 기준선 대비 수학에서 최대 18.1포인트, 에이전트 검색에서 5~9포인트 향상되었고, 질문당 비용은 30~70% 절감되었다. 또한 우리의 결과는 메타인지 격차를 드러낸다: 프롬프트 없이 작동하는 모델은 자신의 문맥이 언제 부실해지고 있는지 신뢰성 있게 판단할 수 없지만, 경량의 루브릭이 이 격차를 해소하여 압축 시점 결정을 훈련 없이 스캐폴드가 제공할 수 있는 역량으로 재정의한다.

English

Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered at a token threshold. Such triggers pay no heed to trajectory structure, risking discard of partial results mid-derivation or mid-search. We propose SelfCompact, a scaffold that allows the model itself to decide when and how to compact. Specifically, it pairs two inference-time elements: (i) a compaction tool the model invokes to summarize the accumulated context, and (ii) a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck). Both are needed. The tool alone is unevenly used across open-weight models, often invoked at unhelpful moments or not at all; the rubric alone cannot act. Together, they elicit effective adaptive compaction without any fine-tuning or external supervision. We present empirical results on six benchmarks (competitive math and agentic search) and seven models. Our results show that SelfCompact matches or exceeds fixed-interval summarization at a fraction of the token cost, improving over a no-summarization baseline by up to 18.1 points on math and 5-9 points on agentic search at 30-70% lower per-question cost. Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap, reframing when to compact as a capability that scaffolds can supply without training.