自压缩语言模型智能体

摘要

由思维链和工具调用组成的长智能体轨迹会积累陈旧内容，这些内容会锚定后续生成，并最终超出上下文窗口。现有框架通过基于令牌阈值的固定间隔压缩来缓解这一问题。然而，此类触发机制未考虑轨迹结构，可能导致推导或搜索过程中部分结果被中途丢弃。我们提出SelfCompact——一种允许模型自主决定何时以及如何压缩的框架。具体而言，它结合了两个推理阶段要素：（i）模型调用的压缩工具，用于总结累积的上下文；（ii）一个轻量级规则清单，规定何时触发压缩（子任务已解决或轨迹趋于收敛）以及何时抑制压缩（推导中途或陷入停滞）。两者缺一不可：仅靠工具时，开源模型的使用方式参差不齐，常在不合适的时机调用或根本不调用；仅靠规则清单则无法执行。两者结合后，无需任何微调或外部监督即可实现有效的自适应压缩。我们在六个基准测试（竞赛数学和智能体搜索）和七个模型上进行了实验。结果表明，SelfCompact以极低的令牌成本达到或超越了固定间隔总结的效果，在数学任务上相比无总结基线提升高达18.1个百分点，在智能体搜索上提升5-9个百分点，同时每个问题的成本降低30-70%。研究结果揭示了一个元认知差距：尽管未经提示的模型无法可靠判断自身上下文何时开始"腐烂"，但轻量级规则清单消除了这一差距，将"何时压缩"重新定义为框架无需训练即可提供的能力。

English

Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered at a token threshold. Such triggers pay no heed to trajectory structure, risking discard of partial results mid-derivation or mid-search. We propose SelfCompact, a scaffold that allows the model itself to decide when and how to compact. Specifically, it pairs two inference-time elements: (i) a compaction tool the model invokes to summarize the accumulated context, and (ii) a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck). Both are needed. The tool alone is unevenly used across open-weight models, often invoked at unhelpful moments or not at all; the rubric alone cannot act. Together, they elicit effective adaptive compaction without any fine-tuning or external supervision. We present empirical results on six benchmarks (competitive math and agentic search) and seven models. Our results show that SelfCompact matches or exceeds fixed-interval summarization at a fraction of the token cost, improving over a no-summarization baseline by up to 18.1 points on math and 5-9 points on agentic search at 30-70% lower per-question cost. Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap, reframing when to compact as a capability that scaffolds can supply without training.