ChatPaper.aiChatPaper

俄语架构的多模态评估

Multimodal Evaluation of Russian-language Architectures

November 19, 2025
作者: Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev, Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova
cs.AI

摘要

多模态大语言模型(MLLMs)当前处于研究关注的核心位置,其规模与能力虽呈现快速发展,但对其智能水平、局限性和风险的理解仍显不足。针对这些问题,特别是在尚无多模态基准测试的俄语语境下,我们推出了Mera Multi——一个面向俄语架构的开放式多模态评估框架。该基准采用基于指令的设计,涵盖默认的文本、图像、音频和视频模态,包含18项全新构建的评估任务,既面向通用模型也适用于特定模态架构(图像到文本、视频到文本及音频到文本)。我们的贡献包括:(i)建立多模态能力的统一分类体系;(ii)完全从零构建的18个数据集,重点关注俄语文化语言特性、统一提示词及评估指标;(iii)闭源与开源模型的基线结果;(iv)包含水印技术和私有集许可的基准泄露防范方法。尽管当前聚焦俄语,但所提出的基准为在类型学多样语言(尤其是斯拉夫语系)中构建多模态基准提供了可复现的方法论。
English
Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.
PDF702November 28, 2025