VibeSearchBench：野外长时域主动搜索的基准测试

摘要

基于大语言模型的智能体在搜索基准测试中表现良好，但真实用户始终对其结果感到不满，这揭示出评估与体验之间持续存在的差距。我们将这一差距归因于现有基准测试依赖过度限定的查询、单轮交互以及固定模式的评估，而这些均未能反映用户与智能体通过多轮对话协同优化模糊意图的真实搜索行为。我们将此范式称为VibeSearch，并引入VibeSearchBench基准——该基准包含200个手动构建的双语（中文与英文）任务，覆盖20个领域，划分为VibeSearch-Pro（专业）与VibeSearch-Daily（日常生活）两个子集。每个任务将用户角色与无模式的事实性知识图谱配对，并通过渐进式披露的用户模拟器与图匹配评估框架进行评估。我们基于ReAct框架与OpenClaw智能体工具集对七种前沿模型进行了基准测试。结果表明，所有模型在VibeSearch任务中仍存在显著不足（最佳F1分数为30.30），凸显了在长上下文推理、主动意图激发和结构化知识构建方面实现根本性突破的必要性。

English

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.