NetPress: Benchmark Dinamicamente Generati per Modelli Linguistici di Grandi Dimensioni nelle Applicazioni di Rete

Abstract

Nonostante il crescente interesse nel benchmarking specifico per dominio dei modelli linguistici di grandi dimensioni (LLM) e degli agenti, le valutazioni attuali rimangono limitate a dataset statici e su piccola scala, specialmente in compiti ad alto rischio come le operazioni di rete che richiedono affidabilità per le implementazioni. Presentiamo NetPress, un framework automatizzato per la generazione di benchmark per valutare gli agenti LLM nelle applicazioni di rete. NetPress introduce un'astrazione unificata con stato e azione, consentendo la generazione dinamica di set di query diversificati insieme alle corrispondenti verità di base. In fase di esecuzione, gli utenti possono specificare configurazioni di benchmark per generare milioni di query al volo. Oltre alla costruzione dinamica dei benchmark, NetPress si integra con emulatori di rete per fornire feedback realistici sull'ambiente, supportando una valutazione completa su correttezza, sicurezza e latenza. Istanziamo NetPress su tre applicazioni rappresentative, rivelando interessanti differenze granulari nel comportamento degli agenti che i benchmark statici, focalizzati solo sulla correttezza, spesso trascurano. NetPress sposta la valutazione degli LLM verso test realistici e scalabili in domini centrati sull'infrastruttura, contribuendo a colmare il divario tra le prestazioni nei benchmark e la prontezza per il dispiegamento nel mondo reale. Il codice è disponibile all'indirizzo https://github.com/Froot-NetSys/NetPress.

English

Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.

NetPress: Benchmark Dinamicamente Generati per Modelli Linguistici di Grandi Dimensioni nelle Applicazioni di Rete

NetPress: Dynamically Generated LLM Benchmarks for Network Applications

Abstract

Support