UQ:评估语言模型在未解决问题上的表现
UQ: Assessing Language Models on Unsolved Questions
August 25, 2025
作者: Fan Nie, Ken Ziyu Liu, Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, Huaxiu Yao, Linjun Zhang, Andrew Y. Ng, James Zou, Sanmi Koyejo, Yejin Choi, Percy Liang, Niklas Muennighoff
cs.AI
摘要
基准测试塑造着人工智能研究的进程。一个有效的基准测试应兼具难度与真实性:问题既要挑战前沿模型,又要反映实际应用场景。然而,当前范式面临难度与真实性之间的张力:考试风格的基准测试往往人为增加难度,却缺乏现实价值;而基于真实用户交互的基准测试则倾向于简单、高频的问题。本研究中,我们探索了一种截然不同的范式:在未解问题上评估模型。不同于一次性评分的静态基准,我们精选未解问题,并通过验证者辅助筛选与社区验证,随时间异步评估模型。我们推出了UQ,一个包含500个挑战性、多样化问题的测试平台,问题源自Stack Exchange,涵盖从计算机科学理论、数学到科幻与历史等多个主题,考察推理、事实准确性和信息检索等能力。UQ在设计上既具难度又贴近现实:未解问题通常难度较高,且自然产生于人类寻求答案的过程中,因此解决它们能直接带来现实价值。我们的贡献有三方面:(1) UQ数据集及其收集流程,结合基于规则的过滤器、大语言模型评判与人工审核,确保问题质量(如定义明确且具挑战性);(2) UQ验证器,采用复合验证策略,利用生成器与验证器之间的差距提供评估信号,并预先筛选候选解决方案供人工审核;(3) UQ平台,一个开放平台,专家在此共同验证问题与解决方案。顶尖模型仅能通过15%的UQ验证问题,初步人工验证已识别出通过验证中的正确答案。UQ为评估前沿模型在现实世界开放性挑战中的表现开辟了道路,成功将推动人类知识的前沿。我们已在https://uq.stanford.edu发布UQ。
English
Benchmarks shape progress in AI research. A useful benchmark should be both
difficult and realistic: questions should challenge frontier models while also
reflecting real-world usage. Yet, current paradigms face a difficulty-realism
tension: exam-style benchmarks are often made artificially difficult with
limited real-world value, while benchmarks based on real user interaction often
skew toward easy, high-frequency problems. In this work, we explore a radically
different paradigm: assessing models on unsolved questions. Rather than a
static benchmark scored once, we curate unsolved questions and evaluate models
asynchronously over time with validator-assisted screening and community
verification. We introduce UQ, a testbed of 500 challenging, diverse questions
sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi
and history, probing capabilities including reasoning, factuality, and
browsing. UQ is difficult and realistic by construction: unsolved questions are
often hard and naturally arise when humans seek answers, thus solving them
yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset
and its collection pipeline combining rule-based filters, LLM judges, and human
review to ensure question quality (e.g., well-defined and difficult); (2)
UQ-Validators, compound validation strategies that leverage the
generator-validator gap to provide evaluation signals and pre-screen candidate
solutions for human review; and (3) UQ-Platform, an open platform where experts
collectively verify questions and solutions. The top model passes UQ-validation
on only 15% of questions, and preliminary human verification has already
identified correct answers among those that passed. UQ charts a path for
evaluating frontier models on real-world, open-ended challenges, where success
pushes the frontier of human knowledge. We release UQ at
https://uq.stanford.edu.