HEMM: 다중모달 기반 모델의 통합적 평가

초록

텍스트와 함께 이미지, 비디오, 오디오 및 기타 감각 모달리티를 종합적으로 처리할 수 있는 멀티모달 기반 모델은 다양한 실제 애플리케이션에서 점점 더 많이 사용되고 있습니다. 그러나 가능한 모델링 결정, 작업 및 도메인의 범위를 고려할 때 멀티모달 기반 모델의 진전을 특성화하고 연구하는 것은 어려운 과제입니다. 본 논문에서는 멀티모달 기반 모델의 능력을 3가지 차원(기본 기술, 정보 흐름, 실제 사용 사례)에 걸쳐 체계적으로 평가하기 위해 Holistic Evaluation of Multimodal Models(HEMM)을 소개합니다. 기본 멀티모달 기술은 문제를 해결하기 위해 필요한 내부 능력으로, 모달리티 간 상호작용 학습, 세밀한 정렬, 다단계 추론, 외부 지식 처리 능력 등을 포함합니다. 정보 흐름은 작업 중 멀티모달 콘텐츠가 질의, 번역, 편집 및 융합을 통해 어떻게 변화하는지를 연구합니다. 사용 사례는 실제 멀티미디어, 감성 컴퓨팅, 자연과학, 헬스케어 및 인간-컴퓨터 상호작용 애플리케이션에서 도입된 도메인별 과제를 포괄합니다. HEMM의 30개 작업에 걸친 포괄적인 실험을 통해 우리는 (1) 오늘날의 모델에 도전을 제기하는 주요 데이터셋 차원(예: 기본 기술, 정보 흐름, 사용 사례)을 식별하고, (2) 다양한 모델링 차원(예: 규모, 사전 학습 데이터, 멀티모달 정렬, 사전 학습 및 지시 튜닝 목표)이 성능에 미치는 영향에 대한 성능 추세를 도출합니다. 도전적인 멀티모달 상호작용, 추론 및 외부 지식이 필요한 사용 사례와 작업, 데이터 및 모델 규모의 이점, 지시 튜닝의 영향에 대한 우리의 결론은 멀티모달 기반 모델의 미래 작업을 위한 실행 가능한 통찰력을 제공합니다.

English

Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance. Our conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the impacts of instruction tuning yield actionable insights for future work in multimodal foundation models.

HEMM: 다중모달 기반 모델의 통합적 평가

HEMM: Holistic Evaluation of Multimodal Foundation Models

초록

Support