NetPress：針對網路應用動態生成的大型語言模型基準測試

摘要

儘管人們對大型語言模型（LLMs）和智能體在特定領域的基準測試興趣日益增長，但目前的評估仍侷限於靜態、小規模的數據集，特別是在需要高可靠性的網絡操作等高風險任務中。我們提出了NetPress，這是一個用於評估網絡應用中LLM智能體的自動化基準生成框架。NetPress引入了一種包含狀態和動作的統一抽象，能夠動態生成多樣化的查詢集及其對應的真實答案。在運行時，用戶可以指定基準配置，以即時生成數百萬條查詢。除了動態基準構建外，NetPress還與網絡模擬器集成，提供真實的環境反饋，支持在正確性、安全性和延遲方面的全面評估。我們在三個代表性應用中實例化了NetPress，揭示了智能體行為中一些有趣且細微的差異，這些差異往往是靜態、僅關注正確性的基準測試所忽略的。NetPress推動了LLM評估向以基礎設施為中心的領域中的現實、可擴展測試邁進，有助於縮小基準測試性能與實際部署準備之間的差距。代碼可在https://github.com/Froot-NetSys/NetPress獲取。

English

Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.

NetPress：針對網路應用動態生成的大型語言模型基準測試

NetPress: Dynamically Generated LLM Benchmarks for Network Applications

摘要

Support