VibeSearchBench: 実環境における長期能動的探索のベンチマーク

要旨

LLMベースのエージェントは検索ベンチマークで高いスコアを示す一方、実際のユーザーは結果に一貫して満足しておらず、持続的な評価と実体験のギャップが明らかになっている。我々はこのギャップを、既存のベンチマークが過度に特定されたクエリ、単一ターンの対話、固定スキーマによる評価に依存していることに起因すると考える。これらはいずれも、ユーザーとエージェントが多ターン対話を通じて曖昧な意図を共同で洗練させる実際の検索行動を反映していない。我々はこのパラダイムをVibeSearchと名付け、20分野にわたる200の手作業でキュレーションされたバイリンガル（中国語と英語）タスクから構成されるベンチマークVibeSearchBenchを導入する。これはVibeSearch-Pro（専門）とVibeSearch-Daily（日常生活）のサブセットに分割される。各タスクは、ユーザーペルソナとスキーマフリーの正解知識グラフをペアとし、段階的開示を行うユーザーシミュレーターとグラフマッチング評価フレームワークによって評価される。我々は7つの最先端モデルを、ReActフレームワークとOpenClawエージェントハーネスの両方でベンチマークした。結果は、全てのモデルがVibeSearchに対して実質的に不十分であること（最高F1値：30.30）を示しており、長文脈推論、積極的な意図引き出し、構造化知識構築における根本的な進歩の必要性を浮き彫りにしている。

English

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.