RAVine：面向现实对齐的智能搜索评估框架

摘要

代理式搜索作为一种更为自主和适应性的检索增强范式，正在推动智能搜索系统的演进。然而，现有的评估框架未能很好地与代理式搜索的目标对齐。首先，当前基准测试中常用的复杂查询往往偏离了真实的用户搜索场景。其次，先前的方法在提取端到端评估的基准真相时，容易引入噪声，导致在细粒度层面上的评估失真。第三，大多数现有框架仅关注最终答案的质量，忽视了代理式搜索固有的迭代过程评估。针对这些局限，我们提出了RAVine——一个面向代理式大语言模型搜索的现实对齐评估框架。RAVine针对更能反映用户意图的多点查询和长文本答案，并引入了一种可归因的基准真相构建策略，以提高细粒度评估的准确性。此外，RAVine考察了模型在整个迭代过程中与搜索工具的交互，并考虑了效率因素。我们使用RAVine对一系列模型进行了基准测试，并得出了若干见解，希望这些能有助于推动代理式搜索系统的发展。代码和数据集可在https://github.com/SwordFaith/RAVine获取。

English

Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine -- a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model's interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.

RAVine：面向现实对齐的智能搜索评估框架

RAVine: Reality-Aligned Evaluation for Agentic Search

摘要

Support