검색 아레나: 검색 강화된 대형 언어 모델 분석

초록

검색 강화 언어 모델(Search-augmented Language Models)은 웹 검색과 대형 언어 모델(LLMs)을 결합하여 응답의 근거성과 최신성을 개선합니다. 그러나 이러한 시스템을 분석하는 것은 여전히 도전적인 과제입니다: 기존 데이터셋은 규모가 제한적이고 범위가 좁으며, 주로 정적이고 단일 턴의 사실 확인 질문에 국한되어 있습니다. 본 연구에서는 24,000개 이상의 다중 턴 사용자 상호작용 쌍으로 구성된 대규모 크라우드소싱 기반 인간 선호도 데이터셋인 Search Arena를 소개합니다. 이 데이터셋은 다양한 의도와 언어를 포괄하며, 약 12,000개의 인간 선호도 투표와 함께 전체 시스템 트레이스를 포함합니다. 우리의 분석은 사용자 선호도가 인용된 내용이 직접적으로 주장을 뒷받침하지 않더라도 인용 횟수에 영향을 받는다는 것을 보여주며, 인지된 신뢰도와 실제 신뢰도 간의 간극을 드러냅니다. 또한, 사용자 선호도는 인용된 출처에 따라 다양하게 나타나며, 커뮤니티 주도 플랫폼이 일반적으로 선호되고 정적 백과사전 출처가 항상 적절하고 신뢰할 만한 것은 아니라는 점을 보여줍니다. 다양한 설정에서의 성능을 평가하기 위해, 우리는 검색 강화 LLMs를 일반 목적 채팅 환경에서 테스트하고, 기존 LLMs를 검색 집중적 환경에서 테스트하는 교차 아레나 분석을 수행합니다. 우리는 웹 검색이 비검색 환경에서 성능을 저하시키지 않으며 오히려 개선할 수 있다는 것을 발견했습니다. 그러나 검색 환경에서는 모델의 파라미터 지식에만 의존할 경우 품질이 크게 영향을 받습니다. 우리는 이 방향의 미래 연구를 지원하기 위해 데이터셋을 오픈소스로 공개했습니다. 우리의 데이터셋과 코드는 https://github.com/lmarena/search-arena에서 확인할 수 있습니다.

English

Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: https://github.com/lmarena/search-arena.

검색 아레나: 검색 강화된 대형 언어 모델 분석

Search Arena: Analyzing Search-Augmented LLMs

초록

Support