UQ: Valutazione dei Modelli Linguistici su Domande Irrisolte

Abstract

I benchmark modellano il progresso nella ricerca sull'IA. Un benchmark utile dovrebbe essere sia difficile che realistico: le domande dovrebbero mettere alla prova i modelli all'avanguardia, riflettendo al contempo l'uso nel mondo reale. Tuttavia, i paradigmi attuali affrontano una tensione tra difficoltà e realismo: i benchmark in stile esame sono spesso resi artificialmente difficili con un valore limitato nel mondo reale, mentre i benchmark basati sull'interazione reale degli utenti tendono a privilegiare problemi semplici e ad alta frequenza. In questo lavoro, esploriamo un paradigma radicalmente diverso: valutare i modelli su domande irrisolte. Piuttosto che un benchmark statico valutato una volta, curiamo domande irrisolte e valutiamo i modelli in modo asincrono nel tempo con screening assistito da validatori e verifica comunitaria. Introduciamo UQ, un banco di prova di 500 domande impegnative e diversificate provenienti da Stack Exchange, che spaziano da teoria dell'informatica e matematica a fantascienza e storia, esplorando capacità come ragionamento, veridicità e navigazione. UQ è difficile e realistico per costruzione: le domande irrisolte sono spesso complesse e sorgono naturalmente quando gli esseri umani cercano risposte, quindi risolverle offre un valore diretto nel mondo reale. I nostri contributi sono tre: (1) UQ-Dataset e la sua pipeline di raccolta che combina filtri basati su regole, giudici LLM e revisione umana per garantire la qualità delle domande (ad esempio, ben definite e difficili); (2) UQ-Validators, strategie di validazione composte che sfruttano il divario generatore-validatore per fornire segnali di valutazione e pre-scremare le soluzioni candidate per la revisione umana; e (3) UQ-Platform, una piattaforma aperta in cui gli esperti verificano collettivamente domande e soluzioni. Il modello migliore supera la validazione UQ solo sul 15% delle domande, e la verifica umana preliminare ha già identificato risposte corrette tra quelle che hanno superato il test. UQ traccia un percorso per valutare i modelli all'avanguardia su sfide aperte e reali, dove il successo spinge i confini della conoscenza umana. Rilasciamo UQ all'indirizzo https://uq.stanford.edu.

English

Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.

UQ: Valutazione dei Modelli Linguistici su Domande Irrisolte

UQ: Assessing Language Models on Unsolved Questions

Abstract

Support