UQ：评估语言模型在未解决问题上的表现

摘要

基准测试塑造着人工智能研究的进程。一个有效的基准测试应兼具难度与真实性：问题既要挑战前沿模型，又要反映实际应用场景。然而，当前范式面临难度与真实性之间的张力：考试风格的基准测试往往人为增加难度，却缺乏现实价值；而基于真实用户交互的基准测试则倾向于简单、高频的问题。本研究中，我们探索了一种截然不同的范式：在未解问题上评估模型。不同于一次性评分的静态基准，我们精选未解问题，并通过验证者辅助筛选与社区验证，随时间异步评估模型。我们推出了UQ，一个包含500个挑战性、多样化问题的测试平台，问题源自Stack Exchange，涵盖从计算机科学理论、数学到科幻与历史等多个主题，考察推理、事实准确性和信息检索等能力。UQ在设计上既具难度又贴近现实：未解问题通常难度较高，且自然产生于人类寻求答案的过程中，因此解决它们能直接带来现实价值。我们的贡献有三方面：(1) UQ数据集及其收集流程，结合基于规则的过滤器、大语言模型评判与人工审核，确保问题质量（如定义明确且具挑战性）；(2) UQ验证器，采用复合验证策略，利用生成器与验证器之间的差距提供评估信号，并预先筛选候选解决方案供人工审核；(3) UQ平台，一个开放平台，专家在此共同验证问题与解决方案。顶尖模型仅能通过15%的UQ验证问题，初步人工验证已识别出通过验证中的正确答案。UQ为评估前沿模型在现实世界开放性挑战中的表现开辟了道路，成功将推动人类知识的前沿。我们已在https://uq.stanford.edu发布UQ。

English

Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.

UQ：评估语言模型在未解决问题上的表现

UQ: Assessing Language Models on Unsolved Questions

摘要

Support