MTR-DuplexBench: 全二重音声言語モデルにおける多回数対話の包括的評価に向けて

要旨

フルデュプレックス音声言語モデル（FD-SLM）は、従来の半二重モデルと比較して、リアルタイムで発話が重なり合う会話的相互作用を可能にし、より動的なユーザー体験を提供する。しかし、既存のベンチマークは主に単一ラウンドの相互作用の評価に焦点を当てており、複数ラウンドにわたるコミュニケーションの複雑さを看過している。FD-SLMを複数ラウンド設定で評価するには、コミュニケーションにおける発話ターン境界の曖昧さや、モデル推論時の文脈の不整合など、重大な課題が存在する。また、既存のベンチマークは会話機能の評価のみに集中し、他の重要な側面を軽視しがちである。これらの課題を解決するため、我々はFD-SLMの包括的な複数ラウンド評価を目的とした新規ベンチマーク「MTR-DuplexBench」を提案する。MTR-DuplexBenchは、連続的なフルデュプレックス対話を個別の発話ターンに分割してターン毎の評価を行うだけでなく、会話機能、対話品質、指示追従性、安全性といった多様な評価側面を統合している。実験結果から、現状のFD-SLMは複数ラウンド及び多次元の評価において一貫した性能を維持するのに困難を伴うことが明らかとなり、本ベンチマークの必要性と有効性が示された。コード及びデータは以下で公開されている：https://github.com/ZhangHe0918/MTR-DuplexBench

English

Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions, neglecting the complexities of multi-round communication. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. Also, existing benchmarks often focus solely on evaluating conversational features, neglecting other critical aspects. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark designed for a comprehensive multi-round evaluation of FD-SLMs. MTR-DuplexBench not only segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment but also incorporates various evaluation aspects, including conversational features, dialogue quality, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our benchmark. Code and data are available at: https://github.com/ZhangHe0918/MTR-DuplexBench

MTR-DuplexBench: 全二重音声言語モデルにおける多回数対話の包括的評価に向けて

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

要旨

Support