UQ：評估語言模型在未解問題上的表現

摘要

基準測試塑造了AI研究的進展。一個有用的基準測試應兼具難度與現實性：問題既要挑戰前沿模型，也要反映實際應用。然而，當前的範式面臨著難度與現實性的矛盾：考試風格的基準測試往往人為地增加難度，卻缺乏現實價值；而基於真實用戶互動的基準測試則傾向於簡單、高頻的問題。在本研究中，我們探索了一種截然不同的範式：在未解問題上評估模型。我們不採用一次性評分的靜態基準，而是策劃未解問題，並通過驗證者輔助篩選和社區驗證，異步地評估模型。我們引入了UQ，這是一個包含500個來自Stack Exchange的挑戰性、多樣化問題的測試平台，涵蓋從計算機理論、數學到科幻和歷史等主題，探討推理、事實性和瀏覽等能力。UQ在設計上既具難度又貼近現實：未解問題通常較難，且自然產生於人類尋求答案的過程中，因此解決這些問題能直接帶來現實價值。我們的主要貢獻有三方面：(1) UQ數據集及其收集流程，結合基於規則的過濾器、LLM評判和人工審查，確保問題質量（如定義明確且具挑戰性）；(2) UQ驗證器，利用生成器與驗證器之間的差距，提供評估信號並預篩選候選解決方案供人工審查；以及(3) UQ平台，一個專家共同驗證問題和解決方案的開放平台。頂尖模型僅在15%的問題上通過了UQ驗證，初步的人工驗證已識別出其中正確的答案。UQ為評估前沿模型在現實世界開放性挑戰中的表現開闢了一條新路，其成功將推動人類知識的前沿。我們在https://uq.stanford.edu發布了UQ。

English

Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.

UQ：評估語言模型在未解問題上的表現

UQ: Assessing Language Models on Unsolved Questions

摘要

Support