RAVine：面向智能体搜索的现实对齐评估

摘要

作为检索增强中更为自主和适应性的范式，代理搜索正在推动智能搜索系统的发展。然而，现有的评估框架未能很好地与代理搜索的目标保持一致。首先，当前基准测试中常用的复杂查询往往偏离了真实的用户搜索场景。其次，先前的方法在提取端到端评估的基准真相时容易引入噪声，导致细粒度层面的评估失真。第三，大多数现有框架仅关注最终答案的质量，忽视了代理搜索固有迭代过程的评估。为解决这些局限，我们提出了RAVine——一个面向代理大语言模型搜索的现实对齐评估框架。RAVine针对更能反映用户意图的多点查询和长答案，并引入了一种可归因的基准真相构建策略，以提高细粒度评估的准确性。此外，RAVine在整个迭代过程中考察模型与搜索工具的交互，并考虑效率因素。我们使用RAVine对一系列模型进行了基准测试，并得出若干见解，希望能推动代理搜索系统的发展。代码和数据集可在https://github.com/SwordFaith/RAVine获取。

English

Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine -- a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model's interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.