NetPress: ネットワークアプリケーション向け動的生成LLMベンチマーク

要旨

大規模言語モデル（LLM）やエージェントのドメイン固有ベンチマークに対する関心が高まる中、現在の評価は依然として静的で小規模なデータセットに限定されており、特にネットワーク運用のような信頼性が求められる高リスクタスクにおいてはその傾向が顕著です。本論文では、ネットワークアプリケーションにおけるLLMエージェントの評価のための自動ベンチマーク生成フレームワーク「NetPress」を提案します。NetPressは、状態とアクションを統合した抽象化を導入し、多様なクエリセットとそれに対応するグラウンドトゥルースを動的に生成することを可能にします。実行時には、ユーザーがベンチマーク設定を指定することで、数百万のクエリをその場で生成できます。動的なベンチマーク構築に加えて、NetPressはネットワークエミュレータと統合し、現実的な環境フィードバックを提供することで、正確性、安全性、レイテンシにわたる包括的な評価をサポートします。NetPressを3つの代表的なアプリケーションに適用し、静的で正確性のみを評価するベンチマークでは見落とされがちなエージェントの振る舞いにおける興味深い細かな差異を明らかにしました。NetPressは、インフラ中心のドメインにおける現実的でスケーラブルなテストに向けてLLM評価を進め、ベンチマーク性能と実世界での展開準備のギャップを埋めるのに役立ちます。コードはhttps://github.com/Froot-NetSys/NetPressで公開されています。

English

Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.

NetPress: ネットワークアプリケーション向け動的生成LLMベンチマーク

NetPress: Dynamically Generated LLM Benchmarks for Network Applications

要旨

Support