NetPress：面向网络应用的动态生成大语言模型基准测试

摘要

尽管针对大语言模型（LLMs）及智能体在特定领域基准测试的兴趣日益增长，当前的评估仍局限于静态、小规模的数据集，特别是在网络操作等高风险任务中，这些任务对部署的可靠性要求极高。我们推出了NetPress，一个自动化基准生成框架，专为评估网络应用中的LLM智能体而设计。NetPress引入了一种包含状态与动作的统一抽象机制，能够动态生成多样化的查询集及其对应的真实答案。在运行时，用户可指定基准配置，即时生成数百万条查询。除了动态基准构建外，NetPress还集成了网络模拟器，以提供真实环境反馈，支持在正确性、安全性和延迟性等方面的全面评估。我们在三个代表性应用中实例化了NetPress，揭示了智能体行为中那些静态、仅关注正确性的基准测试常忽略的细微差异。NetPress推动LLM评估向以基础设施为中心领域的真实、可扩展测试迈进，有助于缩小基准测试表现与实际部署准备度之间的差距。代码已发布于https://github.com/Froot-NetSys/NetPress。

English

Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.

NetPress：面向网络应用的动态生成大语言模型基准测试

NetPress: Dynamically Generated LLM Benchmarks for Network Applications

摘要

Support