自己圧縮型言語モデルエージェント

要旨

思考連鎖とツール呼び出しから構成される長いエージェントのトレースは、古くなったコンテンツを蓄積し、それが後続の生成を固定化し、最終的にはコンテキストウィンドウを超えてしまう。既存のスキャフォールドは、トークン閾値でトリガーされる固定間隔の圧縮によってこれを緩和する。そのようなトリガーは軌跡構造を考慮せず、導出途中や探索途中で部分的な結果を破棄するリスクがある。我々はSelfCompactを提案する。これは、モデル自身がいつどのように圧縮するかを決定できるスキャフォールドである。具体的には、推論時に二つの要素を組み合わせる。(i)モデルが蓄積されたコンテキストを要約するために呼び出す圧縮ツール、および(ii)いつ発火すべきか（サブタスクが解決した、または軌跡が収束しつつある）といつ抑制すべきか（導出途中、または行き詰まった時）を指定する軽量なルーブリックである。両方が必要である。ツール単独では、オープンウェイトモデル間で使用が不均一であり、役に立たないタイミングで呼び出されたり、全く呼び出されなかったりする。ルーブリック単独では動作できない。これらが一緒になることで、微調整や外部からの監督なしに、効果的な適応的圧縮を引き出す。我々は、6つのベンチマーク（競技数学とエージェント探索）と7つのモデルに関する実証結果を示す。我々の結果は、SelfCompactが固定間隔の要約と同等かそれ以上の性能を、はるかに少ないトークンコストで達成し、要約なしのベースラインと比較して、数学で最大18.1ポイント、エージェント探索で5～9ポイントの改善を示し、質問あたりのコストを30～70%削減することを示している。我々の結果は、メタ認知のギャップを明らかにしている。プロンプトなしのモデルは、自身のコンテキストがいつ腐敗しているかを確実に判断できないが、軽量なルーブリックがこのギャップを埋め、いつ圧縮するかを、スキャフォールドが訓練なしで提供できる能力として再定義する。

English

Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered at a token threshold. Such triggers pay no heed to trajectory structure, risking discard of partial results mid-derivation or mid-search. We propose SelfCompact, a scaffold that allows the model itself to decide when and how to compact. Specifically, it pairs two inference-time elements: (i) a compaction tool the model invokes to summarize the accumulated context, and (ii) a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck). Both are needed. The tool alone is unevenly used across open-weight models, often invoked at unhelpful moments or not at all; the rubric alone cannot act. Together, they elicit effective adaptive compaction without any fine-tuning or external supervision. We present empirical results on six benchmarks (competitive math and agentic search) and seven models. Our results show that SelfCompact matches or exceeds fixed-interval summarization at a fraction of the token cost, improving over a no-summarization baseline by up to 18.1 points on math and 5-9 points on agentic search at 30-70% lower per-question cost. Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap, reframing when to compact as a capability that scaffolds can supply without training.