MCIF:基於科學演講的多模態跨語言指令跟隨基準測試
MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
July 25, 2025
作者: Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues
cs.AI
摘要
大型語言模型的最新進展推動了多模態LLM(MLLM)的發展,這些模型將文本、語音和視覺整合在統一的框架中。隨著MLLM從狹窄的、單語言的、任務特定的系統演變為通用指令跟隨模型,一個關鍵的前沿在於評估其在長短上下文中的多語言和多模態能力。然而,現有的基準在聯合評估這些維度方面存在不足:它們通常僅限於英語,大多一次只關注單一模態,依賴於短形式的上下文,或缺乏人工註釋——這阻礙了對模型在語言、模態和任務複雜性方面性能的全面評估。為解決這些不足,我們引入了MCIF(多模態跨語言指令跟隨),這是第一個基於科學講座的多語言人工註釋基準,旨在評估跨語言、多模態設置中對短形式和長形式輸入的指令跟隨能力。MCIF涵蓋了三個核心模態——語音、視覺和文本——以及四種多樣化的語言(英語、德語、意大利語和中文),從而能夠全面評估MLLM在跨語言解釋指令並將其與多模態上下文信息結合的能力。MCIF以CC-BY 4.0許可證發布,以鼓勵MLLM開發中的開放研究和進展。
English
Recent advances in large language models have catalyzed the development of
multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified
frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to
general-purpose instruction-following models, a key frontier lies in evaluating
their multilingual and multimodal capabilities over both long and short
contexts. However, existing benchmarks fall short in evaluating these
dimensions jointly: they are often limited to English, mostly focus on one
single modality at a time, rely on short-form contexts, or lack human
annotations -- hindering comprehensive assessment of model performance across
languages, modalities, and task complexity. To address these gaps, we introduce
MCIF (Multimodal Crosslingual Instruction Following), the first multilingual
human-annotated benchmark based on scientific talks that is designed to
evaluate instruction-following in crosslingual, multimodal settings over both
short- and long-form inputs. MCIF spans three core modalities -- speech,
vision, and text -- and four diverse languages (English, German, Italian, and
Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret
instructions across languages and combine them with multimodal contextual
information. MCIF is released under a CC-BY 4.0 license to encourage open
research and progress in MLLMs development.