LiveBrowseComp: 검색 에이전트는 검색을 하는가, 아니면 단순히 이미 알고 있는 것을 검증하는가?

초록

LLM 기반 검색 에이전트는 진정으로 검색을 수행하는 것인가, 아니면 이미 알고 있는 내용을 웹을 통해 확인하는 것인가? 우리는 BrowseComp를 대상으로 세 가지 진단을 통해 이 질문을 연구한다. 분석 결과, 내재적 지식 의존성(IKD)이 발견되었다. 즉, 도구에 접근할 수 있음에도 불구하고 에이전트는 종종 검색 이전에 모델에 인코딩된 정보인 내재적 지식에 의존하며, 외부 증거에는 의존하지 않는다. 에이전트는 BrowseComp 질문의 최대 44.5%를 도구 없이 답변하며, 검색 질의의 절반 이상을 검색 결과에서 얻은 단서가 아닌 내부적으로 생성한 가설로부터 생성하고, 답변을 뒷받침하는 증거가 제거되었을 때 폐쇄형 기준치보다 낮은 성능을 보인다. 이러한 결과는 정적 검색 벤치마크가 증거 기반 발견보다는 기억 기반 확인에 보상을 제공할 수 있음을 시사하며, 에이전트가 이미 알고 있는 것과 찾을 수 있는 것을 혼동하게 만든다. 이후 우리는 LiveBrowseComp를 도입하는데, 이는 내재적 범위를 넘어서는 에이전트를 평가하도록 설계된 심층 검색 벤치마크이다. 여기에는 6개의 업데이트된 출처에서 수집하고 전 세계적으로 주목받는 사건을 제외하여 필터링한, 벤치마크 구축 시점 기준 90일 이내에 발표된 사실에 의존하는 335개의 인간 작성 질문이 포함된다. LiveBrowseComp에서 평가된 모든 에이전트는 폐쇄형 정확도가 2% 미만으로 떨어졌고, 검색 증강 점수는 BrowseComp 대비 25~40포인트 하락했으며, 이전 모델 순위는 더 이상 성능을 안정적으로 예측하지 못한다. LiveBrowseComp는 https://huggingface.co/datasets/Forival/LiveBrowseComp에서 확인할 수 있다.

English

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.