计划不持久：为何上下文管理对LLM智能体至关重要

摘要

长时域智能体依赖上下文管理：系统对旧令牌进行压缩、摘要和驱逐，使任务能在有限窗口之外持续进行。只有当被丢弃的信息不再需要或已被内化时，这种管理才是安全的。计划（plans）是其中的压力测试案例：它们最早被写入，用于多个步骤，也最先被驱逐。我们引入重放配对（replay pairing）诊断方法，该方法在保留与移除历史计划这两种条件下运行相同轨迹，并测量隐状态余弦距离。在Llama-3.1-70B上，计划信号在计划写入后一步达到0.453的峰值，随后经过单次行动-观测步骤便衰减4.1倍；HotpotQA上衰减12.4倍。这证明标准LLM智能体并未将计划作为持久状态向前传递，而是依赖计划保留在上下文中。L32层探针将此衰减检测为诊断指标，但并不证明其自身能读取计划内容。推理模型引入了一个测量混淆：其`<think>`轨迹会重新推导计划内容，因此标准剥离操作会在剥离条件下留下计划证据。我们将此称为推理轨迹混淆（reasoning-trace confound），并通过严格剥离（strict stripping）解决——仅从剥离运行中移除先前的`<think>`块。该方法在样本内恢复了步骤+1信号+163%，在样本外恢复+153%，而对非推理模型Llama的影响不显著（+4.8%）。在DeepSeek-R1-Distill-Llama-70B上，基于Llama训练的探针以AUROC 0.748（p=6e-4）迁移，而R1专属探针达到1.000，表明R1在不同隐状态方向上编码计划信号。最后，压缩压力测试揭示了实际代价：简单计划驱逐使ALFWorld成功率下降34.7个百分点，而探针门控的重新浮现无法恢复该性能。本文的贡献在于提供了一套测量与压力测试框架，证明智能体关键信息可能驻留于上下文而非持久保存。上下文管理至关重要，但仅保护计划远远不够。

English

Long-horizon agents depend on context management: systems compress, summarize, and evict old tokens so tasks can continue beyond finite windows. That is safe only when dropped information is no longer needed or has been internalized. Plans are the stress case: they are written early, used for many steps, and first to be evicted. We introduce replay pairing, a diagnostic that runs the same trajectory with and without the plan in history and measures hidden-state cosine distance. On Llama-3.1-70B, plan signal spikes to 0.453 one step after the plan, then falls 4.1x in a single action-observation step; HotpotQA falls 12.4x. This is evidence that standard LLM agents do not carry plans forward as persistent state, and instead depend on the plan remaining in context. A layer-L32 probe detects this decay as a diagnostic, not as proof that it reads plan content itself. Reasoning models add a measurement confound: their `<think>` traces re-derive plan content, so standard stripping leaves plan evidence in the stripped condition. We name this the reasoning-trace confound and fix it with strict stripping, which removes prior `<think>` blocks from the stripped run only. It recovers +163% of the step+1 signal in-sample and +153% held out, while not meaningfully changing non-reasoning Llama (+4.8%). On DeepSeek-R1-Distill-Llama-70B, a Llama-trained probe transfers at AUROC 0.748 (p=6e-4), while R1-specific probes reach 1.000, suggesting R1 encodes plan signal in a different hidden-state direction. Finally, a compression stress test shows the practical cost: naive plan eviction cuts ALFWorld success by 34.7pp, while probe-gated re-surfacing does not recover it. The contribution is a measurement and stress-test framework showing that agent-critical information can be context-resident rather than persistent. Context management is load bearing, but plan protection alone is not enough.