ChatPaper.aiChatPaper

HEMM:多模態基礎模型的整體評估

HEMM: Holistic Evaluation of Multimodal Foundation Models

July 3, 2024
作者: Paul Pu Liang, Akshay Goindani, Talha Chafekar, Leena Mathur, Haofei Yu, Ruslan Salakhutdinov, Louis-Philippe Morency
cs.AI

摘要

在各種真實應用中,能夠全面處理文字、圖像、視頻、音頻和其他感官模式的多模基礎模型越來越普遍。然而,由於可能的建模決策、任務和領域範圍廣泛,表徵和研究多模基礎模型的進展是具有挑戰性的。本文引入了全面評估多模型(HEMM)的方法,以系統地評估多模基礎模型在三個維度上的能力:基本技能、信息流動和真實應用案例。基本多模技能是解決問題所需的內在能力,例如跨模式學習互動、細粒度對齊、多步推理和處理外部知識的能力。信息流動研究多模內容在任務過程中如何通過查詢、翻譯、編輯和融合而變化。應用案例涵蓋了在真實世界多媒體、情感計算、自然科學、醫療保健和人機交互應用中引入的特定領域挑戰。通過在HEMM的30個任務上進行全面實驗,我們(1)確定了對當今模型構成挑戰的關鍵數據集維度(例如基本技能、信息流動和應用案例),以及(2)概括了不同建模維度(例如規模、預訓練數據、多模對齊、預訓練和指導調整目標)如何影響性能的性能趨勢。我們對具有挑戰性的多模交互、需要推理和外部知識的任務和應用案例、數據和模型規模的好處,以及指導調整的影響得出的結論為未來多模基礎模型的工作提供了可操作的見解。
English
Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance. Our conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the impacts of instruction tuning yield actionable insights for future work in multimodal foundation models.

Summary

AI-Generated Summary

PDF121November 28, 2024