오래된 관측값을 마스킹하는 것은 검색 에이전트에 도움이 되지만, 그렇지 않은 경우도 있다: 체계 지도와 그 메커니즘

초록

장기 검색 에이전트는 다수의 도구 호출 과정에서 대량의 검색 콘텐츠를 축적하므로, 컨텍스트 예산 효율성의 중요성이 점차 커지고 있다. 최소한의 개입 방법으로는 궤적이 진행됨에 따라 컨텍스트에서 오래된 관측값을 마스킹하는 방식이 있지만, 이러한 형태의 컨텍스트 관리가 언제 도움이 되고 그 이유는 무엇인지는 아직 명확하지 않다. 본 연구에서는 다양한 에이전트 백본(4B~284B 파라미터)과 세 가지 검색기를 대상으로 오프라인 및 실시간 웹 에이전틱 검색 벤치마크에서 체계적인 탐색을 통해 관측값 마스킹을 분석한다. 마스킹으로 인한 정확도 향상은 컨텍스트 관리 없이 얻은 모델의 정확도와 대비하여 그래프로 나타낼 때 비대칭 역U자 형태를 보인다는 사실을 발견했다. 즉, 약한 검색기에서는 정체 구간이, 강력한 검색기와 중간 용량 모델이 결합될 때는 정점이 나타나며, 모델이 포화 상태에 도달하면 급격한 붕괴가 발생한다. 이러한 패턴은 검색기의 재현율과 모델의 암묵적 필터링 능력 간의 상호작용을 반영하며, 어느 한 요인만으로는 설명되지 않는다. 메커니즘적으로, 마스킹은 토큰-턴 간 트레이드오프를 구현한다. 즉, 모델이 대부분 주의를 기울이지 않는 관측값과 에이전트가 거의 다시 열지 않는 페이지를 제거한다. 추가된 턴은 실패를 성공으로 전환할 때 도움이 되지만, 마스킹으로 인해 모델이 사용할 수 있었던 증거가 제거될 때는 실패하게 된다. 따라서 우리는 컨텍스트 관리를 체제 의존적 개입으로 재정의하고, 에이전틱 심층 검색에서 컨텍스트 사용을 분석하기 위한 총체적 관점을 제시한다. 연구 지원을 위해 스캐폴드와 궤적을 공개한다(https://github.com/i-DeepSearch/observation-masking).

English

Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of context management helps and why. We study observation masking through a systematic sweep over various agent backbones (4B to 284B parameters) and three retrievers on offline and live-web agentic search benchmarks. We find that the accuracy gain from masking follows an asymmetric inverted-U shape when plotted against the model's accuracy without context management: a plateau under weak retrievers, a peak when a strong retriever meets a mid-capacity model, and a sharp collapse when the model is saturated. This pattern reflects the interaction between retriever recall and the model's implicit filtering capacity, rather than either factor in isolation. Mechanistically, masking implements a token-for-turn trade-off: it removes observations the model has largely stopped attending to and pages the agent rarely re-opens. The added turns help when they convert failures into successes, but they fail when masking removes evidence the model would otherwise have used. We therefore reframe context management as a regime-dependent intervention and provide a holistic perspective for analyzing context use in agentic deep search. We release our scaffold and trajectories here (https://github.com/i-DeepSearch/observation-masking) to support future research.