StructFlowBench: マルチターン指示追従のための構造化フローベンチマーク

要旨

マルチターン指示追従能力は、現実世界のアプリケーションにおける大規模言語モデル（LLM）の中核的な能力を構成する。既存の評価ベンチマークは、主に細かい制約の満足度やドメイン固有の能力評価に焦点を当てているが、マルチターンとシングルターンの相互作用を区別する対話ターン間の重要な構造的依存関係を見落としている。この構造的依存関係は、ユーザーの意図を反映するだけでなく、制約の満足度を超えた指示追従評価の第二の次元を確立する。このギャップを埋めるため、我々は構造的フローモデリングを備えたマルチターン指示追従ベンチマークであるStructFlowBenchを提案する。このベンチマークは、6つの基本的なターン間関係からなる構造的フレームワークを革新的に定義し、モデル評価のための新しい構造的制約を導入するだけでなく、特定のシナリオに合わせたカスタマイズされた対話フローを作成するための生成パラメータとしても機能する。確立されたLLMベースの自動評価方法論を採用し、13の主要なオープンソースおよびクローズドソースのLLMを体系的に評価する。実験結果は、現在のモデルがマルチターン対話構造を理解する上で重大な欠陥を抱えていることを明らかにする。コードはhttps://github.com/MLGroupJLU/StructFlowBenchで公開されている。

English

Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependency between dialogue turns that distinguishes multi-turn from single-turn interactions. This structural dependency not only reflects user intent but also establishes a second dimension for instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark innovatively defines a structural flow framework comprising six fundamental inter-turn relationships, which not only introduces novel structural constraints for model evaluation but also serves as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at https://github.com/MLGroupJLU/StructFlowBench.

StructFlowBench: マルチターン指示追従のための構造化フローベンチマーク

StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following

要旨

Support