StructFlowBench: 다중 턴 명령어 수행을 위한 구조화된 흐름 벤치마크

초록

다중 턴 명령어 수행 능력은 대규모 언어 모델(LLM)의 실제 응용에서 핵심 역량을 구성한다. 기존 평가 벤치마크는 주로 세부적인 제약 조건 충족 및 도메인 특화 능력 평가에 초점을 맞추고 있으나, 다중 턴 상호작용과 단일 턴 상호작용을 구분짓는 대화 턴 간의 구조적 의존성을 간과하고 있다. 이 구조적 의존성은 사용자 의도를 반영할 뿐만 아니라 제약 조건 충족을 넘어 명령어 수행 평가의 두 번째 차원을 설정한다. 이러한 격차를 해결하기 위해, 우리는 구조적 흐름 모델링을 포함한 다중 턴 명령어 수행 벤치마크인 StructFlowBench를 제안한다. 이 벤치마크는 6가지 기본적인 턴 간 관계로 구성된 구조적 흐름 프레임워크를 혁신적으로 정의하며, 이는 모델 평가를 위한 새로운 구조적 제약 조건을 도입할 뿐만 아니라 특정 시나리오에 맞춤화된 대화 흐름 생성을 위한 생성 파라미터로도 기능한다. 확립된 LLM 기반 자동 평가 방법론을 채택하여, 우리는 13개의 주요 오픈소스 및 클로즈드소스 LLM에 대한 체계적인 평가를 수행한다. 실험 결과는 현재 모델들의 다중 턴 대화 구조 이해에 있어 상당한 결함을 드러낸다. 코드는 https://github.com/MLGroupJLU/StructFlowBench에서 확인할 수 있다.

English

Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependency between dialogue turns that distinguishes multi-turn from single-turn interactions. This structural dependency not only reflects user intent but also establishes a second dimension for instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark innovatively defines a structural flow framework comprising six fundamental inter-turn relationships, which not only introduces novel structural constraints for model evaluation but also serves as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at https://github.com/MLGroupJLU/StructFlowBench.

StructFlowBench: 다중 턴 명령어 수행을 위한 구조화된 흐름 벤치마크

StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following

초록

Support