UQ:評估語言模型在未解問題上的表現
UQ: Assessing Language Models on Unsolved Questions
August 25, 2025
作者: Fan Nie, Ken Ziyu Liu, Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, Huaxiu Yao, Linjun Zhang, Andrew Y. Ng, James Zou, Sanmi Koyejo, Yejin Choi, Percy Liang, Niklas Muennighoff
cs.AI
摘要
基準測試塑造了AI研究的進展。一個有用的基準測試應兼具難度與現實性:問題既要挑戰前沿模型,也要反映實際應用。然而,當前的範式面臨著難度與現實性的矛盾:考試風格的基準測試往往人為地增加難度,卻缺乏現實價值;而基於真實用戶互動的基準測試則傾向於簡單、高頻的問題。在本研究中,我們探索了一種截然不同的範式:在未解問題上評估模型。我們不採用一次性評分的靜態基準,而是策劃未解問題,並通過驗證者輔助篩選和社區驗證,異步地評估模型。我們引入了UQ,這是一個包含500個來自Stack Exchange的挑戰性、多樣化問題的測試平台,涵蓋從計算機理論、數學到科幻和歷史等主題,探討推理、事實性和瀏覽等能力。UQ在設計上既具難度又貼近現實:未解問題通常較難,且自然產生於人類尋求答案的過程中,因此解決這些問題能直接帶來現實價值。我們的主要貢獻有三方面:(1) UQ數據集及其收集流程,結合基於規則的過濾器、LLM評判和人工審查,確保問題質量(如定義明確且具挑戰性);(2) UQ驗證器,利用生成器與驗證器之間的差距,提供評估信號並預篩選候選解決方案供人工審查;以及(3) UQ平台,一個專家共同驗證問題和解決方案的開放平台。頂尖模型僅在15%的問題上通過了UQ驗證,初步的人工驗證已識別出其中正確的答案。UQ為評估前沿模型在現實世界開放性挑戰中的表現開闢了一條新路,其成功將推動人類知識的前沿。我們在https://uq.stanford.edu發布了UQ。
English
Benchmarks shape progress in AI research. A useful benchmark should be both
difficult and realistic: questions should challenge frontier models while also
reflecting real-world usage. Yet, current paradigms face a difficulty-realism
tension: exam-style benchmarks are often made artificially difficult with
limited real-world value, while benchmarks based on real user interaction often
skew toward easy, high-frequency problems. In this work, we explore a radically
different paradigm: assessing models on unsolved questions. Rather than a
static benchmark scored once, we curate unsolved questions and evaluate models
asynchronously over time with validator-assisted screening and community
verification. We introduce UQ, a testbed of 500 challenging, diverse questions
sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi
and history, probing capabilities including reasoning, factuality, and
browsing. UQ is difficult and realistic by construction: unsolved questions are
often hard and naturally arise when humans seek answers, thus solving them
yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset
and its collection pipeline combining rule-based filters, LLM judges, and human
review to ensure question quality (e.g., well-defined and difficult); (2)
UQ-Validators, compound validation strategies that leverage the
generator-validator gap to provide evaluation signals and pre-screen candidate
solutions for human review; and (3) UQ-Platform, an open platform where experts
collectively verify questions and solutions. The top model passes UQ-validation
on only 15% of questions, and preliminary human verification has already
identified correct answers among those that passed. UQ charts a path for
evaluating frontier models on real-world, open-ended challenges, where success
pushes the frontier of human knowledge. We release UQ at
https://uq.stanford.edu.