StructEval: Benchmarking delle Capacità degli LLM nella Generazione di Output Strutturati

Abstract

Man mano che i Large Language Model (LLM) diventano parte integrante dei flussi di lavoro nello sviluppo software, la loro capacità di generare output strutturati è diventata di fondamentale importanza. Introduciamo StructEval, un benchmark completo per valutare le capacità dei LLM nella produzione di formati strutturati sia non renderizzabili (JSON, YAML, CSV) che renderizzabili (HTML, React, SVG). A differenza dei benchmark precedenti, StructEval valuta sistematicamente la fedeltà strutturale attraverso diversi formati utilizzando due paradigmi: 1) task di generazione, che producono output strutturato a partire da prompt in linguaggio naturale, e 2) task di conversione, che traducono tra formati strutturati. Il nostro benchmark comprende 18 formati e 44 tipi di task, con metriche innovative per l'aderenza al formato e la correttezza strutturale. I risultati rivelano significative lacune nelle prestazioni: anche modelli all'avanguardia come o1-mini raggiungono solo un punteggio medio di 75,58, con alternative open-source che rimangono indietro di circa 10 punti. Abbiamo riscontrato che i task di generazione sono più complessi rispetto a quelli di conversione, e che produrre contenuti visivi corretti è più difficile rispetto alla generazione di strutture testuali.

English

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

StructEval: Benchmarking delle Capacità degli LLM nella Generazione di Output Strutturati

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Abstract

Support