StructEval: 구조적 출력 생성 능력에 대한 LLM 벤치마킹

초록

대규모 언어 모델(LLMs)이 소프트웨어 개발 워크플로우에 필수적인 요소로 자리 잡으면서, 구조화된 출력을 생성하는 능력이 매우 중요해졌다. 본 연구에서는 LLMs의 비렌더링 가능(JSON, YAML, CSV) 및 렌더링 가능(HTML, React, SVG) 구조화 형식 생성을 평가하기 위한 포괄적인 벤치마크인 StructEval을 소개한다. 기존 벤치마크와 달리, StructEval은 두 가지 패러다임을 통해 다양한 형식 간의 구조적 충실도를 체계적으로 평가한다: 1) 자연어 프롬프트에서 구조화된 출력을 생성하는 생성 작업, 2) 구조화된 형식 간 변환을 수행하는 변환 작업. 본 벤치마크는 18가지 형식과 44가지 유형의 작업을 포함하며, 형식 준수 및 구조적 정확성을 평가하기 위한 새로운 메트릭을 제안한다. 결과에 따르면, 최첨단 모델인 o1-mini조차 평균 75.58점에 그치며, 오픈소스 대안들은 약 10점 가량 뒤처지는 것으로 나타났다. 생성 작업이 변환 작업보다 더 어려운 것으로 나타났으며, 시각적 콘텐츠를 정확하게 생성하는 것이 텍스트 전용 구조를 생성하는 것보다 더 어려운 것으로 확인되었다.

English

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

StructEval: 구조적 출력 생성 능력에 대한 LLM 벤치마킹

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

초록

Support