Struc-Bench：大型語言模型真的擅長生成複雜結構化資料嗎？

摘要

儘管像GPT-4這樣的大型語言模型（LLMs）具有強大的能力，但它們仍然在需要生成複雜結構輸出的任務上遇到困難。在這項研究中，我們評估了目前LLMs在生成複雜結構數據方面的能力，並提出了一種結構感知微調方法作為改善這種能力的解決方案。為了進行全面評估，我們提出了Struc-Bench，其中包括五種代表性的LLMs（即GPT-NeoX 20B、GPT-3.5、GPT-4和Vicuna），並在我們精心構建的跨原始文本、HTML和LaTeX表格的數據集上對它們進行評估。基於我們對當前模型性能的分析，我們確定了特定的常見格式錯誤和潛在改進領域。為了應對複雜的格式要求，我們利用FormatCoT（Chain-of-Thought）從目標輸出生成格式指令。我們的實驗表明，當應用於LLaMA-7B時，我們的結構感知微調方法顯著提高了對自然語言約束的遵循，優於其他評估的LLMs。基於這些結果，我們提出了一個模型能力地圖，從六個維度（即覆蓋範圍、格式、推理、理解、語用和幻覺）突顯了LLMs在處理複雜結構輸出方面的弱點，並提出了未來工作的有前途的方向。我們的代碼和模型可在https://github.com/gersteinlab/Struc-Bench 找到。

English

Despite the power of Large Language Models (LLMs) like GPT-4, they still struggle with tasks that require generating complex, structured outputs. In this study, we assess the capability of Current LLMs in generating complex structured data and propose a structure-aware fine-tuning approach as a solution to improve this ability. To perform a comprehensive evaluation, we propose Struc-Bench, include five representative LLMs (i.e., GPT-NeoX 20B, GPT-3.5, GPT-4, and Vicuna) and evaluate them on our carefully constructed datasets spanning raw text, HTML, and LaTeX tables. Based on our analysis of current model performance, we identify specific common formatting errors and areas of potential improvement. To address complex formatting requirements, we utilize FormatCoT (Chain-of-Thought) to generate format instructions from target outputs. Our experiments show that our structure-aware fine-tuning method, when applied to LLaMA-7B, significantly improves adherence to natural language constraints, outperforming other evaluated LLMs. Based on these results, we present an ability map of model capabilities from six dimensions (i.e., coverage, formatting, reasoning, comprehension, pragmatics, and hallucination). This map highlights the weaknesses of LLMs in handling complex structured outputs and suggests promising directions for future work. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.

Struc-Bench：大型語言模型真的擅長生成複雜結構化資料嗎？

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

摘要

Support