MCIF:基于科学讲座的多模态跨语言指令跟随基准
MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks
July 25, 2025
作者: Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues
cs.AI
摘要
近期大型语言模型的进展推动了多模态大语言模型(MLLMs)的发展,这些模型在统一框架下整合了文本、语音和视觉信息。随着MLLMs从单一语言、任务特定的系统演变为通用指令跟随模型,一个关键前沿在于评估其在长短期上下文中的多语言和多模态能力。然而,现有基准在联合评估这些维度方面存在不足:它们通常局限于英语,大多一次只关注单一模态,依赖短文本上下文,或缺乏人工标注——这阻碍了对模型跨语言、跨模态及任务复杂性的全面评估。为填补这些空白,我们推出了MCIF(多模态跨语言指令跟随),这是首个基于科学讲座的多语言人工标注基准,旨在评估跨语言、多模态环境下对短期和长期输入的指令跟随能力。MCIF涵盖语音、视觉和文本三大核心模态,以及四种多样化的语言(英语、德语、意大利语和中文),从而能够全面评估MLLMs跨语言理解指令并结合多模态上下文信息的能力。MCIF以CC-BY 4.0许可发布,以鼓励MLLMs开发领域的开放研究与进步。
English
Recent advances in large language models have catalyzed the development of
multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified
frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to
general-purpose instruction-following models, a key frontier lies in evaluating
their multilingual and multimodal capabilities over both long and short
contexts. However, existing benchmarks fall short in evaluating these
dimensions jointly: they are often limited to English, mostly focus on one
single modality at a time, rely on short-form contexts, or lack human
annotations -- hindering comprehensive assessment of model performance across
languages, modalities, and task complexity. To address these gaps, we introduce
MCIF (Multimodal Crosslingual Instruction Following), the first multilingual
human-annotated benchmark based on scientific talks that is designed to
evaluate instruction-following in crosslingual, multimodal settings over both
short- and long-form inputs. MCIF spans three core modalities -- speech,
vision, and text -- and four diverse languages (English, German, Italian, and
Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret
instructions across languages and combine them with multimodal contextual
information. MCIF is released under a CC-BY 4.0 license to encourage open
research and progress in MLLMs development.