계획은 지속되지 않는다: LLM 에이전트에게 컨텍스트 관리가 중요한 이유

초록

장기 지평 에이전트는 컨텍스트 관리에 의존한다. 시스템은 토큰을 압축, 요약, 제거하여 유한한 창을 넘어서도 작업을 계속할 수 있도록 한다. 이는 삭제된 정보가 더 이상 필요하지 않거나 내재화된 경우에만 안전하다. 계획은 스트레스 사례이다. 계획은 초기에 작성되고 여러 단계에 걸쳐 사용되며 가장 먼저 제거된다. 우리는 재생 쌍(replay pairing)을 도입한다. 이는 계획이 히스토리에 포함된 경우와 포함되지 않은 경우로 동일한 궤적을 실행하고 은닉 상태 코사인 거리(hidden-state cosine distance)를 측정하는 진단 기법이다. Llama-3.1-70B에서 계획 신호는 계획 직후 한 단계에서 0.453으로 급등한 후, 단일 행동-관찰 단계에서 4.1배 감소한다. HotpotQA에서는 12.4배 감소한다. 이는 표준 LLM 에이전트가 계획을 지속적인 상태로 전달하지 않고, 대신 계획이 컨텍스트에 남아 있는 것에 의존한다는 증거이다. 레이어 L32 프로브는 이 감쇠를 진단 도구로 탐지할 뿐, 프로브 자체가 계획 내용을 읽는다는 증거는 아니다. 추론 모델은 측정 교란을 추가한다. 이들의 `<think>` 흔적은 계획 내용을 재도출하므로, 표준 제거 방식은 제거된 조건에서도 계획 증거를 남긴다. 우리는 이를 추론 흔적 교란(reasoning-trace confound)이라고 명명하고, 엄격한 제거(strict stripping)로 해결한다. 이는 제거된 실행에서만 이전 `<think>` 블록을 제거한다. 이는 표본 내에서 단계+1 신호를 +163%, 표본 외에서 +153% 회복시키며, 비추론 Llama에서는 유의미한 변화를 일으키지 않는다(+4.8%). DeepSeek-R1-Distill-Llama-70B에서 Llama로 훈련된 프로브는 AUROC 0.748(p=6e-4)로 전이되는 반면, R1 특화 프로브는 1.000에 도달한다. 이는 R1이 계획 신호를 다른 은닉 상태 방향으로 인코딩함을 시사한다. 마지막으로 압축 스트레스 테스트는 실용적 비용을 보여준다. 순진한 계획 제거는 ALFWorld 성공률을 34.7퍼센트포인트(p.p.) 낮추는 반면, 프로브 게이트 방식의 재표면화는 이를 회복하지 못한다. 본 연구의 기여는 에이전트 핵심 정보가 지속적이기보다 컨텍스트 상주적일 수 있음을 보여주는 측정 및 스트레스 테스트 프레임워크이다. 컨텍스트 관리는 하중을 지탱하지만, 계획 보호만으로는 충분하지 않다.

English

Long-horizon agents depend on context management: systems compress, summarize, and evict old tokens so tasks can continue beyond finite windows. That is safe only when dropped information is no longer needed or has been internalized. Plans are the stress case: they are written early, used for many steps, and first to be evicted. We introduce replay pairing, a diagnostic that runs the same trajectory with and without the plan in history and measures hidden-state cosine distance. On Llama-3.1-70B, plan signal spikes to 0.453 one step after the plan, then falls 4.1x in a single action-observation step; HotpotQA falls 12.4x. This is evidence that standard LLM agents do not carry plans forward as persistent state, and instead depend on the plan remaining in context. A layer-L32 probe detects this decay as a diagnostic, not as proof that it reads plan content itself. Reasoning models add a measurement confound: their `<think>` traces re-derive plan content, so standard stripping leaves plan evidence in the stripped condition. We name this the reasoning-trace confound and fix it with strict stripping, which removes prior `<think>` blocks from the stripped run only. It recovers +163% of the step+1 signal in-sample and +153% held out, while not meaningfully changing non-reasoning Llama (+4.8%). On DeepSeek-R1-Distill-Llama-70B, a Llama-trained probe transfers at AUROC 0.748 (p=6e-4), while R1-specific probes reach 1.000, suggesting R1 encodes plan signal in a different hidden-state direction. Finally, a compression stress test shows the practical cost: naive plan eviction cuts ALFWorld success by 34.7pp, while probe-gated re-surfacing does not recover it. The contribution is a measurement and stress-test framework showing that agent-critical information can be context-resident rather than persistent. Context management is load bearing, but plan protection alone is not enough.