HEMM：多模态基础模型的整体评估

摘要

能够在文本、图像、视频、音频和其他感官模态之间全面处理的多模态基础模型越来越多地应用于各种实际应用中。然而，由于可能的建模决策、任务和领域范围，表征和研究多模态基础模型的进展具有挑战性。在本文中，我们引入了全面评估多模态模型（HEMM）的方法，系统评估多模态基础模型在一组三个维度上的能力：基本技能、信息流和实际应用案例。基本多模态技能是解决问题所需的内部能力，例如学习跨模态交互、细粒度对齐、多步推理和处理外部知识的能力。信息流研究多模态内容在任务中如何通过查询、翻译、编辑和融合而变化。应用案例涵盖了在真实世界的多媒体、情感计算、自然科学、医疗保健和人机交互应用中引入的领域特定挑战。通过在HEMM的30个任务上进行全面实验，我们（1）确定了对当今模型构成挑战的关键数据集维度（例如基本技能、信息流和应用案例），以及（2）概括了不同建模维度（例如规模、预训练数据、多模态对齐、预训练和指导调整目标）如何影响性能的性能趋势。我们关于具有挑战性的多模态交互、需要推理和外部知识的任务、数据和模型规模的好处，以及指导调整的影响的结论为未来多模态基础模型的工作提供了可操作的见解。

English

Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance. Our conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the impacts of instruction tuning yield actionable insights for future work in multimodal foundation models.

HEMM：多模态基础模型的整体评估

HEMM: Holistic Evaluation of Multimodal Foundation Models

摘要

Support