TorchUMM：評価、分析、およびポストトレーニングのための統合マルチモーダルモデルコードベース

要旨

統合マルチモーダルモデル（UMM）の最近の進展により、視覚とテキストのモダリティを横断して理解・生成・編集を行うアーキテクチャが急増している。しかし、モデルアーキテクチャの多様性や、学習パラダイム・実装詳細の不均一性により、UMMの統合フレームワークの開発は依然として困難な課題である。本論文では、多様なUMMバックボーン・タスク・データセットにわたる包括的評価、分析、および学習後処理を可能にする初の統合コードベース「TorchUMM」を提案する。TorchUMMは、広範な規模と設計パラダイムをカバーする多種多様なモデルをサポートする。我々のベンチマークは、マルチモーダル理解、生成、編集という3つの核心的タスク次元を包含し、知覚、推論、構成性、指示追従能力を評価するため、確立されたデータセットと新規データセットの両統合する。統一インターフェースと標準化された評価プロトコルを提供することで、TorchUMMは不均一なモデル間の公平かつ再現可能な比較を実現し、それらの強みと限界に関するより深い洞察を促進し、より高機能な統合マルチモーダルシステムの開発を容易にする。コードはhttps://github.com/AIFrontierLab/TorchUMM で公開されている。

English

Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.

TorchUMM：評価、分析、およびポストトレーニングのための統合マルチモーダルモデルコードベース

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

要旨

Support