計画は永続しない：LLMエージェントにとってコンテキスト管理が不可欠である理由

要旨

長期的なエージェントはコンテキスト管理に依存する。システムは古いトークンを圧縮、要約、破棄することで、タスクが有限なウィンドウを超えて継続できるようにする。これは、破棄された情報が不要になったか、内部化された場合にのみ安全である。計画はストレスケースである。すなわち、計画は早い段階で記述され、多くのステップで使用され、最初に破棄される対象となる。本稿ではリプレイペアリングを導入する。これは、履歴に計画がある場合とない場合で同じ軌道を実行し、隠れ状態のコサイン距離を測定する診断手法である。Llama-3.1-70Bでは、計画信号は計画の1ステップ後に0.453に急上昇し、その後1回の行動-観測ステップで4.1倍低下する。HotpotQAでは12.4倍低下する。これは、標準的なLLMエージェントが計画を永続的な状態として前方に保持せず、代わりに計画がコンテキスト内に残っていることに依存しているという証拠である。レイヤーL32のプローブはこの減衰を診断として検出するが、プローブ自体が計画内容を読み取っているという証明ではない。推論モデルは測定上の交絡要因を追加する。それらの`<think>`トレースは計画内容を再導出するため、標準的なストリッピングでは、ストリップされた条件に計画の証拠が残る。我々はこれを推論トレース交絡と名付け、厳格なストリッピングで修正する。これはストリップされた実行からのみ以前の`<think>`ブロックを削除する。これにより、サンプル内でステップ+1の信号の+163%を回復し、ホールドアウトで+153%を回復する。一方、非推論型のLlamaでは有意な変化は見られない（+4.8%）。DeepSeek-R1-Distill-Llama-70Bにおいて、Llamaで学習されたプローブはAUROC 0.748（p=6e-4）で転移するが、R1固有のプローブは1.000に達する。これはR1が計画信号を異なる隠れ状態方向に符号化していることを示唆する。最後に、圧縮ストレステストが実際のコストを示す。単純な計画破棄はALFWorldの成功率を34.7ポイント低下させるが、プローブゲートによる再表出化はそれを回復しない。貢献は、エージェントにとって重要な情報が永続的ではなくコンテキスト常駐であり得ることを示す測定およびストレステストのフレームワークである。コンテキスト管理は重要な役割を担っているが、計画保護だけでは十分ではない。

English

Long-horizon agents depend on context management: systems compress, summarize, and evict old tokens so tasks can continue beyond finite windows. That is safe only when dropped information is no longer needed or has been internalized. Plans are the stress case: they are written early, used for many steps, and first to be evicted. We introduce replay pairing, a diagnostic that runs the same trajectory with and without the plan in history and measures hidden-state cosine distance. On Llama-3.1-70B, plan signal spikes to 0.453 one step after the plan, then falls 4.1x in a single action-observation step; HotpotQA falls 12.4x. This is evidence that standard LLM agents do not carry plans forward as persistent state, and instead depend on the plan remaining in context. A layer-L32 probe detects this decay as a diagnostic, not as proof that it reads plan content itself. Reasoning models add a measurement confound: their `<think>` traces re-derive plan content, so standard stripping leaves plan evidence in the stripped condition. We name this the reasoning-trace confound and fix it with strict stripping, which removes prior `<think>` blocks from the stripped run only. It recovers +163% of the step+1 signal in-sample and +153% held out, while not meaningfully changing non-reasoning Llama (+4.8%). On DeepSeek-R1-Distill-Llama-70B, a Llama-trained probe transfers at AUROC 0.748 (p=6e-4), while R1-specific probes reach 1.000, suggesting R1 encodes plan signal in a different hidden-state direction. Finally, a compression stress test shows the practical cost: naive plan eviction cuts ALFWorld success by 34.7pp, while probe-gated re-surfacing does not recover it. The contribution is a measurement and stress-test framework showing that agent-critical information can be context-resident rather than persistent. Context management is load bearing, but plan protection alone is not enough.