MCIF：基於科學演講的多模態跨語言指令跟隨基準測試

摘要

大型語言模型的最新進展推動了多模態LLM（MLLM）的發展，這些模型將文本、語音和視覺整合在統一的框架中。隨著MLLM從狹窄的、單語言的、任務特定的系統演變為通用指令跟隨模型，一個關鍵的前沿在於評估其在長短上下文中的多語言和多模態能力。然而，現有的基準在聯合評估這些維度方面存在不足：它們通常僅限於英語，大多一次只關注單一模態，依賴於短形式的上下文，或缺乏人工註釋——這阻礙了對模型在語言、模態和任務複雜性方面性能的全面評估。為解決這些不足，我們引入了MCIF（多模態跨語言指令跟隨），這是第一個基於科學講座的多語言人工註釋基準，旨在評估跨語言、多模態設置中對短形式和長形式輸入的指令跟隨能力。MCIF涵蓋了三個核心模態——語音、視覺和文本——以及四種多樣化的語言（英語、德語、意大利語和中文），從而能夠全面評估MLLM在跨語言解釋指令並將其與多模態上下文信息結合的能力。MCIF以CC-BY 4.0許可證發布，以鼓勵MLLM開發中的開放研究和進展。

English

Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations -- hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.

MCIF：基於科學演講的多模態跨語言指令跟隨基準測試

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

摘要

Support