VibeSearchBench:在真實場景中對長時程主動搜尋進行基準測試
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
May 27, 2026
作者: Xiaohongshu Inc
cs.AI
摘要
基於大型語言模型的智能體在搜尋基準測試中表現良好,然而實際用戶始終覺得搜尋結果未能令人滿意,這揭示了評估與體驗之間持續存在的差距。我們將此差距歸因於現有基準依賴過度指定的查詢、單輪互動及固定結構的評估,這些均無法反映真實的搜尋行為——在真實情境中,用戶與智能體透過多輪對話協作式地逐步明確模糊意圖。我們將此典範稱為「VibeSearch」,並提出「VibeSearchBench」——一個包含200項跨20個領域、經人工策劃的雙語(中文與英文)任務的基準,分為VibeSearch-Pro(專業)與VibeSearch-Daily(日常生活)兩個子集。每項任務搭配一個用戶角色與一份無固定結構的真實知識圖譜,並透過漸進式披露的用戶模擬器及圖匹配評估框架進行評估。我們在ReAct框架與OpenClaw智能體工具集下,對七個前沿模型進行了基準測試。結果顯示,所有模型在VibeSearch上的表現仍遠不理想(最高F1分數為30.30),凸顯在長上下文推理、主動意圖引導及結構化知識建構方面,仍需取得根本性進展。
English
LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.