Cacciare invece di aspettare: valutare la ricerca sui dati profondi per i grandi modelli linguistici

Abstract

L'agenzialità attesa dai Modelli Linguistici di Grande Dimensione (LLM) agentivi va oltre il rispondere correttamente, richiedendo l'autonomia di stabilire obiettivi e decidere cosa esplorare. Definiamo questa capacità *intelligenza investigativa*, distinguendola dall'*intelligenza esecutiva*, che si limita a portare a termine compiti assegnati. La Scienza dei Dati fornisce un banco di prova naturale, poiché l'analisi nel mondo reale parte da dati grezzi piuttosto che da query esplicite, eppure pochi benchmark si concentrano su di essa. Per colmare questa lacuna, introduciamo Deep Data Research (DDR), un task aperto in cui gli LLM estraggono autonomamente insight chiave da database, e DDR-Bench, un benchmark su larga scala, basato su checklist, che consente una valutazione verificabile. I risultati mostrano che, sebbene i modelli all'avanguardia mostrino un'agenzialità emergente, l'esplorazione di lungo periodo rimane una sfida. La nostra analisi evidenzia che un'efficace intelligenza investigativa dipende non solo dall'infrastruttura agentiva (scaffolding) o dal semplice scaling, ma anche dalle strategie intrinseche dei modelli agentivi.

English

The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.

Cacciare invece di aspettare: valutare la ricerca sui dati profondi per i grandi modelli linguistici

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

Abstract

Support