UQ: 미해결 질문에 대한 언어 모델 평가

초록

벤치마크는 AI 연구의 진보를 이끌어갑니다. 유용한 벤치마크는 어려우면서도 현실적이어야 합니다: 질문은 최첨단 모델에 도전적이면서도 실제 사용 사례를 반영해야 합니다. 그러나 현재의 패러다임은 난이도와 현실성 간의 긴장 관계에 직면해 있습니다: 시험 스타일의 벤치마크는 종종 인위적으로 어렵게 만들어져 실제 가치가 제한적이며, 실제 사용자 상호작용을 기반으로 한 벤치마크는 쉬운 고빈도 문제로 치우치는 경향이 있습니다. 본 연구에서는 근본적으로 다른 패러다임을 탐구합니다: 해결되지 않은 질문에 대해 모델을 평가하는 것입니다. 한 번 점수가 매겨지는 정적 벤치마크 대신, 우리는 해결되지 않은 질문을 선별하고 검증자 지원 스크리닝과 커뮤니티 검증을 통해 시간에 따라 비동기적으로 모델을 평가합니다. 우리는 UQ를 소개합니다. 이는 Stack Exchange에서 수집한 500개의 도전적이고 다양한 질문으로 구성된 테스트베드로, 컴퓨터 과학 이론과 수학부터 공상과학과 역사에 이르기까지 다양한 주제를 다루며 추론, 사실성, 탐색 능력을 탐구합니다. UQ는 설계상 어렵고 현실적입니다: 해결되지 않은 질문은 종종 어렵고 인간이 답을 찾을 때 자연스럽게 발생하므로 이를 해결하면 직접적인 현실적 가치를 얻을 수 있습니다. 우리의 기여는 세 가지입니다: (1) UQ 데이터셋과 질문 품질(예: 명확하고 어려운)을 보장하기 위해 규칙 기반 필터, LLM 판단자, 인간 검토를 결합한 수집 파이프라인; (2) 생성자-검증자 간극을 활용하여 평가 신호를 제공하고 인간 검토를 위한 후보 솔루션을 사전 스크리닝하는 복합 검증 전략인 UQ 검증자; (3) 전문가들이 질문과 솔루션을 집단적으로 검증하는 오픈 플랫폼인 UQ 플랫폼. 최고의 모델도 UQ 검증을 통과한 질문은 15%에 불과하며, 예비 인간 검증에서 이미 통과한 답변 중 정답이 확인되었습니다. UQ는 최첨단 모델을 현실적이고 개방형 도전 과제에 대해 평가하는 길을 제시하며, 성공은 인간 지식의 최전선을 넓힙니다. UQ는 https://uq.stanford.edu에서 공개됩니다.

English

Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.

UQ: 미해결 질문에 대한 언어 모델 평가

UQ: Assessing Language Models on Unsolved Questions

초록

Support