Struc-Bench：大型语言模型在生成复杂结构化数据方面表现良好吗？

摘要

尽管像GPT-4这样的大型语言模型（LLMs）具有强大的能力，但它们仍然在需要生成复杂结构输出的任务中遇到困难。在本研究中，我们评估了当前LLMs在生成复杂结构化数据方面的能力，并提出了一种结构感知微调方法作为改进这种能力的解决方案。为了进行全面评估，我们提出了Struc-Bench，包括五种代表性LLMs（即GPT-NeoX 20B、GPT-3.5、GPT-4和Vicuna），并在我们精心构建的跨原始文本、HTML和LaTeX表的数据集上对它们进行评估。基于我们对当前模型性能的分析，我们确定了特定的常见格式错误和潜在改进领域。为了解决复杂格式要求，我们利用FormatCoT（Chain-of-Thought）从目标输出中生成格式说明。我们的实验表明，当应用于LLaMA-7B时，我们的结构感知微调方法显著改善了符合自然语言约束，优于其他评估的LLMs。根据这些结果，我们提出了模型能力的六个维度（即覆盖范围、格式、推理、理解、语用和幻觉）的能力图。这张图突出了LLMs在处理复杂结构化输出方面的弱点，并为未来工作提出了有前途的方向。我们的代码和模型可以在https://github.com/gersteinlab/Struc-Bench找到。

English

Despite the power of Large Language Models (LLMs) like GPT-4, they still struggle with tasks that require generating complex, structured outputs. In this study, we assess the capability of Current LLMs in generating complex structured data and propose a structure-aware fine-tuning approach as a solution to improve this ability. To perform a comprehensive evaluation, we propose Struc-Bench, include five representative LLMs (i.e., GPT-NeoX 20B, GPT-3.5, GPT-4, and Vicuna) and evaluate them on our carefully constructed datasets spanning raw text, HTML, and LaTeX tables. Based on our analysis of current model performance, we identify specific common formatting errors and areas of potential improvement. To address complex formatting requirements, we utilize FormatCoT (Chain-of-Thought) to generate format instructions from target outputs. Our experiments show that our structure-aware fine-tuning method, when applied to LLaMA-7B, significantly improves adherence to natural language constraints, outperforming other evaluated LLMs. Based on these results, we present an ability map of model capabilities from six dimensions (i.e., coverage, formatting, reasoning, comprehension, pragmatics, and hallucination). This map highlights the weaknesses of LLMs in handling complex structured outputs and suggests promising directions for future work. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.

Struc-Bench：大型语言模型在生成复杂结构化数据方面表现良好吗？

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

摘要

Support