HEMM: マルチモーダル基盤モデルの包括的評価

要旨

テキストと画像、動画、音声、その他の感覚モダリティを統合的に処理できるマルチモーダル基盤モデルは、様々な実世界のアプリケーションでますます使用されています。しかし、モデリングの決定、タスク、ドメインの範囲を考えると、マルチモーダル基盤モデルの進歩を特徴づけ、研究することは困難です。本論文では、マルチモーダルモデルの包括的評価（HEMM）を導入し、マルチモーダル基盤モデルの能力を3つの次元（基本スキル、情報フロー、実世界のユースケース）にわたって体系的に評価します。基本的なマルチモーダルスキルは、問題を解決するために必要な内部能力であり、モダリティ間の相互作用の学習、細粒度のアラインメント、多段階の推論、外部知識の処理能力などが含まれます。情報フローは、クエリ、翻訳、編集、融合を通じてタスク中にマルチモーダルコンテンツがどのように変化するかを研究します。ユースケースは、実世界のマルチメディア、感情計算、自然科学、医療、人間とコンピュータの相互作用アプリケーションで導入されるドメイン固有の課題に及びます。HEMMの30のタスクにわたる包括的な実験を通じて、私たちは（1）今日のモデルにとって課題となる主要なデータセット次元（例：基本スキル、情報フロー、ユースケース）を特定し、（2）異なるモデリング次元（例：スケール、事前学習データ、マルチモーダルアラインメント、事前学習、指示チューニングの目的）が性能にどのように影響するかに関するパフォーマンストレンドを抽出します。私たちの結論は、推論と外部知識を必要とするマルチモーダル相互作用、ユースケース、タスク、データとモデルのスケールの利点、指示チューニングの影響について、マルチモーダル基盤モデルの将来の研究に役立つ洞察を提供します。

English

Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance. Our conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the impacts of instruction tuning yield actionable insights for future work in multimodal foundation models.

HEMM: マルチモーダル基盤モデルの包括的評価

HEMM: Holistic Evaluation of Multimodal Foundation Models

要旨

Support