搜尋競技場:分析搜尋增強型大型語言模型
Search Arena: Analyzing Search-Augmented LLMs
June 5, 2025
作者: Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
cs.AI
摘要
搜索增强型语言模型将网络搜索与大型语言模型(LLMs)相结合,以提升回答的准确性和时效性。然而,分析这些系统仍具挑战性:现有数据集规模有限且范围狭窄,通常局限于静态、单轮的事实核查问题。在本研究中,我们引入了Search Arena,一个众包的大规模人类偏好数据集,包含超过24,000对多轮用户与搜索增强型LLMs的交互。该数据集涵盖多种意图和语言,并包含完整的系统追踪记录及约12,000次人类偏好投票。我们的分析显示,用户偏好受引用数量的影响,即使引用的内容并未直接支持所归因的主张,揭示了感知可信度与实际可信度之间的差距。此外,用户偏好因引用来源而异,表明社区驱动平台普遍更受青睐,而静态百科全书式来源并非总是恰当且可靠。为了评估不同环境下的性能,我们通过跨领域分析,在通用聊天环境中测试搜索增强型LLMs,并在搜索密集型环境中测试传统LLMs。我们发现,在非搜索环境中,网络搜索不会降低甚至可能提升性能;然而,在搜索环境中,若仅依赖模型的参数化知识,质量将受到显著影响。我们开源了该数据集以支持未来此方向的研究。我们的数据集和代码可在以下网址获取:https://github.com/lmarena/search-arena。
English
Search-augmented language models combine web search with Large Language
Models (LLMs) to improve response groundedness and freshness. However,
analyzing these systems remains challenging: existing datasets are limited in
scale and narrow in scope, often constrained to static, single-turn,
fact-checking questions. In this work, we introduce Search Arena, a
crowd-sourced, large-scale, human-preference dataset of over 24,000 paired
multi-turn user interactions with search-augmented LLMs. The dataset spans
diverse intents and languages, and contains full system traces with around
12,000 human preference votes. Our analysis reveals that user preferences are
influenced by the number of citations, even when the cited content does not
directly support the attributed claims, uncovering a gap between perceived and
actual credibility. Furthermore, user preferences vary across cited sources,
revealing that community-driven platforms are generally preferred and static
encyclopedic sources are not always appropriate and reliable. To assess
performance across different settings, we conduct cross-arena analyses by
testing search-augmented LLMs in a general-purpose chat environment and
conventional LLMs in search-intensive settings. We find that web search does
not degrade and may even improve performance in non-search settings; however,
the quality in search settings is significantly affected if solely relying on
the model's parametric knowledge. We open-sourced the dataset to support future
research in this direction. Our dataset and code are available at:
https://github.com/lmarena/search-arena.