MathNet: un benchmark multimodale globale per il ragionamento e il retrieval matematico

Abstract

La risoluzione di problemi matematici rimane una prova impegnativa di ragionamento per i modelli linguistici e multimodali di grandi dimensioni, ma i benchmark esistenti sono limitati in termini di dimensione, copertura linguistica e diversità dei compiti. Presentiamo MathNet, un dataset multimodale e multilingue di alta qualità e su larga scala, composto da problemi matematici di livello olimpionico, insieme a un benchmark per valutare il ragionamento matematico nei modelli generativi e il retrieval matematico nei sistemi basati su embedding. MathNet copre 47 paesi, 17 lingue e due decenni di competizioni, comprendendo 30.676 problemi creati da esperti con soluzioni in diversi domini. Oltre al dataset principale, abbiamo costruito un benchmark di retrieval costituito da coppie di problemi matematicamente equivalenti e strutturalmente simili, curate da esperti umani. MathNet supporta tre compiti: (i) Risoluzione di Problemi, (ii) Retrieval Consapevole della Matematica e (iii) Risoluzione di Problemi Potenziata dal Retrieval. I risultati sperimentali mostrano che anche i modelli di ragionamento all'avanguardia (78,4% per Gemini-3.1-Pro e 69,3% per GPT-5) rimangono messi alla prova, mentre i modelli di embedding faticano a recuperare problemi equivalenti. Mostriamo inoltre che le prestazioni della generazione aumentata dal retrieval sono altamente sensibili alla qualità del retrieval; ad esempio, DeepSeek-V3.2-Speciale ottiene miglioramenti fino al 12%, raggiungendo i punteggi più alti sul benchmark. MathNet fornisce il più grande dataset di alta qualità di problemi olimpionici insieme al primo benchmark per valutare il retrieval di problemi matematici, e rilasciamo pubblicamente sia il dataset che il benchmark all'indirizzo https://mathnet.mit.edu.

English

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.

MathNet: un benchmark multimodale globale per il ragionamento e il retrieval matematico

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Abstract

Support