MCIF：基于科学讲座的多模态跨语言指令跟随基准

摘要

近期大型语言模型的进展推动了多模态大语言模型（MLLMs）的发展，这些模型在统一框架下整合了文本、语音和视觉信息。随着MLLMs从单一语言、任务特定的系统演变为通用指令跟随模型，一个关键前沿在于评估其在长短期上下文中的多语言和多模态能力。然而，现有基准在联合评估这些维度方面存在不足：它们通常局限于英语，大多一次只关注单一模态，依赖短文本上下文，或缺乏人工标注——这阻碍了对模型跨语言、跨模态及任务复杂性的全面评估。为填补这些空白，我们推出了MCIF（多模态跨语言指令跟随），这是首个基于科学讲座的多语言人工标注基准，旨在评估跨语言、多模态环境下对短期和长期输入的指令跟随能力。MCIF涵盖语音、视觉和文本三大核心模态，以及四种多样化的语言（英语、德语、意大利语和中文），从而能够全面评估MLLMs跨语言理解指令并结合多模态上下文信息的能力。MCIF以CC-BY 4.0许可发布，以鼓励MLLMs开发领域的开放研究与进步。

English

Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations -- hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.

MCIF：基于科学讲座的多模态跨语言指令跟随基准

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

摘要

Support