NetPress: 네트워크 애플리케이션을 위한 동적 생성 LLM 벤치마크

초록

대규모 언어 모델(LLM)과 에이전트에 대한 도메인 특화 벤치마킹에 대한 관심이 증가하고 있음에도 불구하고, 현재의 평가는 특히 네트워크 운영과 같이 배포 시 신뢰성이 요구되는 고위험 작업에서 정적이고 소규모의 데이터셋에 국한되어 있습니다. 우리는 네트워크 애플리케이션에서 LLM 에이전트를 평가하기 위한 자동화된 벤치마크 생성 프레임워크인 NetPress를 소개합니다. NetPress는 상태와 동작을 통합한 추상화를 도입하여 다양한 쿼리 세트와 해당하는 정답을 동적으로 생성할 수 있게 합니다. 런타임에서 사용자는 벤치마크 구성을 지정하여 수백만 개의 쿼리를 즉시 생성할 수 있습니다. 동적 벤치마크 구성 외에도, NetPress는 네트워크 에뮬레이터와 통합되어 현실적인 환경 피드백을 제공함으로써 정확성, 안전성, 지연 시간에 걸친 포괄적인 평가를 지원합니다. 우리는 NetPress를 세 가지 대표적인 애플리케이션에 적용하여, 정적이고 정확성만을 평가하는 벤치마크가 종종 놓치는 에이전트 행동의 미세한 차이를 발견했습니다. NetPress는 LLM 평가를 인프라 중심 도메인에서 현실적이고 확장 가능한 테스트로 이동시켜, 벤치마크 성능과 실제 배포 준비 사이의 격차를 줄이는 데 기여합니다. 코드는 https://github.com/Froot-NetSys/NetPress에서 확인할 수 있습니다.

English

Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.