ChatPaper.aiChatPaper

StructFlowBench:多輪指令跟蹤的結構化流程基準測試

StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following

February 20, 2025
作者: Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu
cs.AI

摘要

多輪指令跟隨能力是大型語言模型(LLMs)在實際應用中的核心競爭力。現有的評估基準主要集中於細粒度約束滿足和特定領域能力評估,卻忽視了區分多輪與單輪互動的關鍵結構依賴性。這種結構依賴性不僅反映了用戶意圖,還為指令跟隨評估建立了超越約束滿足的第二維度。為填補這一空白,我們提出了StructFlowBench,一個帶有結構流建模的多輪指令跟隨基準。該基準創新性地定義了一個包含六種基本輪間關係的結構流框架,不僅為模型評估引入了新穎的結構約束,還作為生成參數用於創建針對特定場景定制的對話流。採用已建立的基於LLM的自動評估方法,我們對13個領先的開源和閉源LLM進行了系統評估。實驗結果揭示了當前模型在多輪對話結構理解上的顯著不足。代碼可在https://github.com/MLGroupJLU/StructFlowBench獲取。
English
Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependency between dialogue turns that distinguishes multi-turn from single-turn interactions. This structural dependency not only reflects user intent but also establishes a second dimension for instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark innovatively defines a structural flow framework comprising six fundamental inter-turn relationships, which not only introduces novel structural constraints for model evaluation but also serves as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at https://github.com/MLGroupJLU/StructFlowBench.

Summary

AI-Generated Summary

PDF152February 24, 2025