Plannen blijven niet bestaan: waarom contextbeheer dragend is voor LLM-agenten

Samenvatting

Agenten met een lange horizon zijn afhankelijk van contextbeheer: systemen comprimeren, samenvatten en verwijderen oude tokens, zodat taken kunnen doorgaan voorbij eindige vensters. Dit is alleen veilig wanneer verwijderde informatie niet langer nodig is of is geïnternaliseerd. Plannen zijn het stressgeval: ze worden vroeg opgesteld, voor veel stappen gebruikt en als eerste verwijderd. Wij introduceren replay-pairing, een diagnostiek die dezelfde trajectorie uitvoert met en zonder het plan in de geschiedenis en de cosinusafstand van de verborgen toestand meet. Op Llama-3.1-70B stijgt het plansignaal tot 0,453 één stap na het plan, en daalt vervolgens 4,1x in een enkele actie-waarnemingsstap; HotpotQA daalt 12,4x. Dit is bewijs dat standaard LLM-agenten plannen niet als persistente toestand vooruit dragen, maar in plaats daarvan afhankelijk zijn van het plan dat in de context blijft. Een laag-L32-probe detecteert dit verval als diagnostiek, niet als bewijs dat het zelf planinhoud leest. Redeneringsmodellen voegen een meetverstorende factor toe: hun `<think>`-sporen leiden planinhoud opnieuw af, zodat standaard stripping planbewijs achterlaat in de gestripte conditie. We noemen dit de redeneringsspoor-verstorende factor en lossen het op met strikte stripping, die alleen eerdere `<think>`-blokken verwijdert uit de gestripte uitvoering. Het herstelt +163% van het stap+1-signaal in-sample en +153% out-of-sample, terwijl het niet-renderende Llama niet significant verandert (+4,8%). Op DeepSeek-R1-Distill-Llama-70B transfereert een op Llama getrainde probe met AUROC 0,748 (p=6e-4), terwijl R1-specifieke probes 1,000 bereiken, wat suggereert dat R1 plansignaal codeert in een andere richting van de verborgen toestand. Tot slot toont een compressie-stresstest de praktische kosten: naïeve planverwijdering verlaagt het ALFWorld-succes met 34,7 procentpunt, terwijl probe-gestuurde heraanbieding dit niet herstelt. De bijdrage is een meet- en stresstestkader dat aantoont dat agent-kritieke informatie contextresident kan zijn in plaats van persistent. Contextbeheer is dragend, maar alleen planbescherming is niet genoeg.

English

Long-horizon agents depend on context management: systems compress, summarize, and evict old tokens so tasks can continue beyond finite windows. That is safe only when dropped information is no longer needed or has been internalized. Plans are the stress case: they are written early, used for many steps, and first to be evicted. We introduce replay pairing, a diagnostic that runs the same trajectory with and without the plan in history and measures hidden-state cosine distance. On Llama-3.1-70B, plan signal spikes to 0.453 one step after the plan, then falls 4.1x in a single action-observation step; HotpotQA falls 12.4x. This is evidence that standard LLM agents do not carry plans forward as persistent state, and instead depend on the plan remaining in context. A layer-L32 probe detects this decay as a diagnostic, not as proof that it reads plan content itself. Reasoning models add a measurement confound: their `<think>` traces re-derive plan content, so standard stripping leaves plan evidence in the stripped condition. We name this the reasoning-trace confound and fix it with strict stripping, which removes prior `<think>` blocks from the stripped run only. It recovers +163% of the step+1 signal in-sample and +153% held out, while not meaningfully changing non-reasoning Llama (+4.8%). On DeepSeek-R1-Distill-Llama-70B, a Llama-trained probe transfers at AUROC 0.748 (p=6e-4), while R1-specific probes reach 1.000, suggesting R1 encodes plan signal in a different hidden-state direction. Finally, a compression stress test shows the practical cost: naive plan eviction cuts ALFWorld success by 34.7pp, while probe-gated re-surfacing does not recover it. The contribution is a measurement and stress-test framework showing that agent-critical information can be context-resident rather than persistent. Context management is load bearing, but plan protection alone is not enough.