모든 것을 모든 곳에서 한 번에 벤치마킹하라

초록

벤치마크는 표준화되고 명확한 성능 측정 기준을 제공함으로써 LLM(거대 언어 모델)과 MLLM(멀티모달 거대 언어 모델)을 평가하고 발전시키는 데 핵심적인 역할을 한다. 그러나 벤치마크 구축은 많은 노동력을 필요로 하며 재사용이 어려워 지속 가능성과 확장성에 대한 우려를 낳는다. 또한 기존 벤치마크는 출시 이후 빠르게 성능 포화 상태에 도달하는 경우가 많아, 최첨단 모델 간의 변별력이 부족해진다. 이러한 문제를 해결하기 위해 본 연구에서는 벤치마크 구축을 완전히 자동화한 에이전트 시스템인 Benchmark Agent를 소개한다. 이 프레임워크는 사용자 질의 분석, 하위 과제 설계, 데이터 주석 및 품질 관리에 이르기까지 벤치마크 구축 전 과정을 조율한다. Benchmark Agent를 평가하기 위해, 텍스트 이해, 멀티모달 이해, 도메인 특화 추론 등 다양한 평가 시나리오를 아우르는 15개의 대표적인 벤치마크를 생성하였다. 인간 평가, LLM-as-a-judge 평가, 일관성 검증을 포함한 광범위한 실험 결과, Benchmark Agent가 최소한의 인간 개입으로 고품질의 벤치마크 샘플을 생성할 수 있음을 입증하였다. 더욱 중요한 점은 지속적인 평가를 통해 현재 모델이 특정 도메인 특화 추론 과제에서 어려움을 겪는다는 통찰을 얻었다는 것이다. 빠르게 진화하는 벤치마크는 연구 커뮤니티에 크게 기여할 수 있을 것으로 기대한다. 미리보기와 코드는 데모 페이지 및 코드 저장소에서 공개될 예정이다.

English

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.