UQ: 未解決問題に対する言語モデルの評価

要旨

ベンチマークはAI研究の進歩を形作る。有用なベンチマークは、困難でありながらも現実的であるべきだ：質問は最先端のモデルに挑戦するものであると同時に、実世界の使用状況を反映している必要がある。しかし、現在のパラダイムは困難さと現実性の間でジレンマに直面している：試験形式のベンチマークはしばしば人工的に難しくされ、実世界での価値が限られている一方で、実際のユーザーインタラクションに基づくベンチマークは、簡単で高頻度の問題に偏りがちである。本研究では、未解決の質問を用いてモデルを評価するという根本的に異なるパラダイムを探求する。一度だけスコアが付けられる静的なベンチマークではなく、未解決の質問をキュレーションし、バリデータ支援のスクリーニングとコミュニティによる検証を通じて、時間をかけて非同期にモデルを評価する。我々はUQを導入する。これはStack Exchangeから収集した500の困難で多様な質問からなるテストベッドであり、計算機科学理論や数学からSFや歴史まで幅広いトピックをカバーし、推論、事実性、ブラウジングなどの能力を探る。UQは設計上、困難でありながら現実的である：未解決の質問はしばしば難しく、人間が答えを求める際に自然に発生するため、それらを解決することは直接的な実世界の価値を生む。我々の貢献は3つある：(1) UQ-Datasetとその収集パイプライン。ルールベースのフィルタ、LLMジャッジ、人間によるレビューを組み合わせて質問の品質（例：明確で困難なもの）を保証する。(2) UQ-Validators。生成者とバリデータのギャップを活用して評価信号を提供し、人間によるレビューのための候補ソリューションを事前にスクリーニングする複合検証戦略。(3) UQ-Platform。専門家が共同で質問とソリューションを検証するオープンプラットフォーム。トップモデルはUQ検証を通過した質問のわずか15%しか合格せず、予備的な人間による検証では、合格した中に正しい答えがすでに特定されている。UQは、実世界のオープンエンドな課題において最先端モデルを評価する道を切り開き、成功が人間の知識のフロンティアを押し広げる。我々はUQをhttps://uq.stanford.eduで公開する。

English

Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.

UQ: 未解決問題に対する言語モデルの評価

UQ: Assessing Language Models on Unsolved Questions

要旨

Support