MCIF: 科学講演からのマルチモーダル・クロスリンガル指示追従ベンチマーク

要旨

大規模言語モデルの最近の進展により、テキスト、音声、視覚を統合したマルチモーダルLLM（MLLM）の開発が加速しています。MLLMが特定タスクに特化した単一言語システムから汎用目的の指示追従モデルへと進化する中で、重要なフロンティアは、長文脈と短文脈の両方における多言語・マルチモーダル能力の評価にあります。しかし、既存のベンチマークはこれらの次元を同時に評価するには不十分です。英語に限定されていることが多く、単一モダリティに焦点を当てることが多い、短文脈に依存している、または人間によるアノテーションが欠如しているため、言語、モダリティ、タスクの複雑さにわたるモデル性能の包括的な評価が妨げられています。これらのギャップを埋めるため、我々はMCIF（Multimodal Crosslingual Instruction Following）を導入します。これは、科学講演に基づいた初の多言語人間アノテーションベンチマークであり、短・長文脈の入力における多言語・マルチモーダル設定での指示追従を評価するように設計されています。MCIFは、音声、視覚、テキストの3つの主要モダリティと、英語、ドイツ語、イタリア語、中国語の4つの多様な言語をカバーし、MLLMが言語間で指示を解釈し、マルチモーダル文脈情報と組み合わせる能力を包括的に評価することを可能にします。MCIFはCC-BY 4.0ライセンスの下で公開され、MLLM開発におけるオープンな研究と進展を促進します。

English

Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations -- hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.

MCIF: 科学講演からのマルチモーダル・クロスリンガル指示追従ベンチマーク

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

要旨

Support