StructEval: 構造的出力を生成するLLMの能力を評価するベンチマーク

要旨

大規模言語モデル（LLMs）がソフトウェア開発ワークフローに不可欠な存在となるにつれ、構造化された出力を生成する能力が極めて重要となっている。本論文では、LLMsが非レンダリング可能な形式（JSON、YAML、CSV）およびレンダリング可能な形式（HTML、React、SVG）の構造化フォーマットを生成する能力を評価するための包括的なベンチマーク「StructEval」を提案する。従来のベンチマークとは異なり、StructEvalは、1）自然言語プロンプトから構造化出力を生成する生成タスク、および2）構造化フォーマット間の変換を行う変換タスクという2つのパラダイムを通じて、多様なフォーマットにおける構造的忠実性を体系的に評価する。本ベンチマークは18のフォーマットと44種類のタスクを網羅し、フォーマット遵守度と構造的正確性を測定するための新たな指標を導入している。結果として、最先端のモデルであるo1-miniでさえ平均スコア75.58に留まり、オープンソースの代替モデルは約10ポイント遅れを取ることが明らかとなった。生成タスクは変換タスクよりも難易度が高く、視覚的なコンテンツを正確に生成することはテキストのみの構造を生成するよりも困難であることが判明した。

English

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

StructEval: 構造的出力を生成するLLMの能力を評価するベンチマーク

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

要旨

Support