ChatPaper.aiChatPaper

StructEval:评估大语言模型生成结构化输出的能力基准

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

May 26, 2025
作者: Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, Wenhu Chen
cs.AI

摘要

随着大型语言模型(LLMs)在软件开发工作流程中变得不可或缺,其生成结构化输出的能力变得至关重要。我们推出了StructEval,这是一个全面评估LLMs在生成非可渲染(JSON、YAML、CSV)和可渲染(HTML、React、SVG)结构化格式方面能力的基准。与以往基准不同,StructEval通过两种范式系统地评估了跨多种格式的结构保真度:1)生成任务,即从自然语言提示中生成结构化输出;2)转换任务,即在结构化格式之间进行转换。我们的基准涵盖了18种格式和44种任务类型,并引入了格式遵循和结构正确性的新颖度量标准。结果显示,性能差距显著,即便是最先进的模型如o1-mini,平均得分也仅为75.58,而开源替代品则落后约10分。我们发现生成任务比转换任务更具挑战性,生成正确的视觉内容比仅生成纯文本结构更为困难。
English
As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

Summary

AI-Generated Summary

PDF181May 27, 2025