VibeSearchBench: 실제 환경에서의 장기적 선제적 검색 벤치마킹

초록

LLM 기반 에이전트는 검색 벤치마크에서 높은 점수를 기록하지만, 실제 사용자들은 지속적으로 결과에 불만족을 느껴 평가와 경험 간의 지속적인 괴리가 드러난다. 우리는 이러한 괴리가 기존 벤치마크가 지나치게 명세화된 질의, 단일 턴 상호작용, 고정된 스키마 평가에 의존하기 때문이라고 본다. 이러한 요소들은 사용자와 에이전트가 다중 턴 대화를 통해 모호한 의도를 협력적으로 정교화하는 실제 검색 행동을 반영하지 못한다. 우리는 이 패러다임을 VibeSearch라고 명명하고, 20개 도메인에 걸쳐 수작업으로 선별된 200개의 이중 언어(중국어 및 영어) 작업으로 구성된 벤치마크인 VibeSearchBench를 소개한다. 이 벤치마크는 VibeSearch-Pro(전문)와 VibeSearch-Daily(일상) 하위 집합으로 나뉜다. 각 작업은 사용자 페르소나와 스키마가 없는 정답 지식 그래프를 짝지으며, 점진적 정보 공개 사용자 시뮬레이터와 그래프 매칭 평가 프레임워크를 통해 평가된다. 우리는 ReAct 프레임워크와 OpenClaw 에이전트 하네스 모두에서 7개의 최첨단 모델을 벤치마킹한다. 결과는 모든 모델이 VibeSearch에 대해 상당히 부적합함을 보여준다(최고 F1: 30.30). 이는 장기 맥락 추론, 능동적 의도 도출, 구조화된 지식 구축의 근본적인 발전 필요성을 강조한다.

English

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.