StructFlowBench: Een gestructureerd stroombenchmark voor multi-turn instructievolging

Samenvatting

De mogelijkheid om instructies over meerdere beurten te volgen vormt een kerncompetentie van grote taalmodellen (LLMs) in praktijktoepassingen. Bestaande evaluatiebenchmarks richten zich voornamelijk op het voldoen aan gedetailleerde beperkingen en het beoordelen van domeinspecifieke capaciteiten, maar negeren de cruciale structurele afhankelijkheid tussen dialoogbeurten die multi-turn interacties onderscheidt van single-turn interacties. Deze structurele afhankelijkheid weerspiegelt niet alleen de gebruikersintentie, maar stelt ook een tweede dimensie vast voor de evaluatie van het volgen van instructies, naast het voldoen aan beperkingen. Om dit hiaat aan te pakken, stellen we StructFlowBench voor, een benchmark voor het volgen van instructies over meerdere beurten met modellering van structurele flow. De benchmark introduceert innovatief een structureel flow-raamwerk dat zes fundamentele inter-turn relaties omvat, wat niet alleen nieuwe structurele beperkingen introduceert voor model evaluatie, maar ook dient als generatieparameters voor het creëren van op maat gemaakte dialoogflows die zijn afgestemd op specifieke scenario's. Door gevestigde LLM-gebaseerde automatische evaluatiemethodologieën te hanteren, voeren we systematische evaluaties uit van 13 toonaangevende open-source en closed-source LLMs. Experimentele resultaten onthullen aanzienlijke tekortkomingen in het begrip van huidige modellen van multi-turn dialoogstructuren. De code is beschikbaar op https://github.com/MLGroupJLU/StructFlowBench.

English

Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependency between dialogue turns that distinguishes multi-turn from single-turn interactions. This structural dependency not only reflects user intent but also establishes a second dimension for instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark innovatively defines a structural flow framework comprising six fundamental inter-turn relationships, which not only introduces novel structural constraints for model evaluation but also serves as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at https://github.com/MLGroupJLU/StructFlowBench.

StructFlowBench: Een gestructureerd stroombenchmark voor multi-turn instructievolging

StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following

Samenvatting

Summary

Support

Support