ChatPaper.aiChatPaper

StructEval:評估大型語言模型生成結構化輸出能力的基準測試

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

May 26, 2025
作者: Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, Wenhu Chen
cs.AI

摘要

随着大型语言模型(LLMs)在软件开发工作流程中变得不可或缺,其生成结构化输出的能力变得至关重要。我们引入了StructEval,这是一个全面的基准测试,用于评估LLMs在生成不可渲染(JSON、YAML、CSV)和可渲染(HTML、React、SVG)结构化格式方面的能力。与以往的基准测试不同,StructEval通过两种范式系统地评估了跨多种格式的结构保真度:1)生成任务,从自然语言提示中生成结构化输出;2)转换任务,在结构化格式之间进行转换。我们的基准测试涵盖了18种格式和44种任务类型,并引入了新的指标来衡量格式遵循和结构正确性。结果显示,即使是最先进的模型如o1-mini,其平均得分也仅为75.58,而开源替代品则落后约10分。我们发现生成任务比转换任务更具挑战性,生成正确的视觉内容比生成仅包含文本的结构更为困难。
English
As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

Summary

AI-Generated Summary

PDF181May 27, 2025