サーチアリーナ：検索拡張型LLMの分析

要旨

検索拡張型言語モデルは、Web検索と大規模言語モデル（LLMs）を組み合わせることで、応答の信憑性と新鮮さを向上させます。しかし、これらのシステムを分析することは依然として困難です。既存のデータセットは規模が限られており、範囲も狭く、しばしば静的な単一ターンのファクトチェック質問に制約されています。本研究では、Search Arenaを紹介します。これは、クラウドソーシングによる大規模な人間の選好データセットで、24,000以上のペアになった多ターンのユーザーインタラクションを検索拡張型LLMsと共に収録しています。このデータセットは多様な意図と言語をカバーし、約12,000の人間の選好投票を含む完全なシステムトレースを提供します。我々の分析によると、ユーザーの選好は引用の数に影響を受けることが明らかになりました。引用された内容が主張を直接支持していない場合でも、認識された信憑性と実際の信憑性の間にギャップがあることがわかりました。さらに、ユーザーの選好は引用元によって異なり、コミュニティ主導のプラットフォームが一般的に好まれる一方で、静的な百科事典的ソースは必ずしも適切で信頼できるとは限らないことが示されました。異なる設定でのパフォーマンスを評価するため、我々はクロスアリーナ分析を行い、検索拡張型LLMsを汎用チャット環境で、従来のLLMsを検索集中型の設定でテストしました。その結果、Web検索は非検索設定でのパフォーマンスを低下させず、むしろ向上させる可能性があることがわかりました。しかし、検索設定での品質は、モデルのパラメトリック知識にのみ依存する場合、大きく影響を受けることが明らかになりました。我々は、この方向性の将来の研究を支援するため、データセットをオープンソース化しました。データセットとコードは以下で利用可能です：https://github.com/lmarena/search-arena。

English

Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: https://github.com/lmarena/search-arena.

サーチアリーナ：検索拡張型LLMの分析

Search Arena: Analyzing Search-Augmented LLMs

要旨

Support