I benchmark si saturano quando il modello diventa più intelligente del giudice

Abstract

I benchmark sono strumenti importanti per monitorare i progressi nello sviluppo dei Large Language Model (LLM), ma le imprecisioni nei dataset e nei metodi di valutazione ne minano costantemente l'efficacia. Presentiamo qui Omni-MATH-2, una versione rivista manualmente del dataset Omni-MATH che comprende un sottoinsieme pulito con risposte esatte (n=4181) e un sottoinsieme etichettato e non standard (n=247). Ogni problema è stato verificato per garantire la compilabilità in LaTeX, la risolvibilità e la verificabilità, operazione che ha comportato l'aggiunta di figure o informazioni mancanti, l'etichettatura dei problemi che richiedono una dimostrazione, una stima o un'immagine e la rimozione del disordine. Questo processo riduce significativamente il rumore indotto dal dataset, fornendo così una valutazione più precisa delle prestazioni del modello. Il dataset annotato ci permette anche di valutare il rumore indotto dal giudice confrontando GPT-5 mini con l'Omni-Judge originale, rivelando discrepanze sostanziali tra i giudici sia sui sottoinsiemi di problemi puliti che su quelli etichettati. Le annotazioni esperte rivelano che Omni-Judge è errato nel 96,4% delle discrepanze tra giudici, indicando la sua incapacità di differenziare le abilità dei modelli, anche molto prima che il benchmark raggiunga la saturazione. Man mano che i problemi diventano più complessi, scopriamo che giudici sempre più competenti diventano essenziali per evitare che gli errori di giudizio mascherino le differenze genuine tra i modelli. Infine, nessuno dei due giudici identifica le modalità di fallimento presenti per il sottoinsieme di problemi etichettati, dimostrando che la qualità del dataset e l'affidabilità del giudice sono entrambe critiche per sviluppare benchmark accurati delle prestazioni dei modelli.

English

Benchmarks are important tools to track progress in the development of Large Language Models (LLMs), yet inaccuracies in datasets and evaluation methods consistently undermine their effectiveness. Here, we present Omni-MATH-2, a manually revised version of the Omni-MATH dataset comprising a clean, exact-answer subset (n{=}4181) and a tagged, non-standard subset (n{=}247). Each problem was audited to ensure LaTeX compilability, solvability and verifiability, which involved adding missing figures or information, labeling problems requiring a proof, estimation or image, and removing clutter. This process significantly reduces dataset-induced noise, thereby providing a more precise assessment of model performance. The annotated dataset also allows us to evaluate judge-induced noise by comparing GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets. Expert annotations reveal that Omni-Judge is wrong in 96.4% of the judge disagreements, indicating its inability to differentiate between models' abilities, even well before saturation of the benchmark occurs. As problems become more challenging, we find that increasingly competent judges become essential in order to prevent judge errors from masking genuine differences between models. Finally, neither judge identifies the present failure modes for the subset of tagged problems, demonstrating that dataset quality and judge reliability are both critical to develop accurate benchmarks of model performance.

I benchmark si saturano quando il modello diventa più intelligente del giudice

Benchmarks Saturate When The Model Gets Smarter Than The Judge

Abstract

Support