TorchUMM：面向评估、分析與訓練後處理的統一多模態模型代碼庫

摘要

近期統一多模態模型（UMMs）的技術進展催生了多種能理解、生成並跨視覺與文本模態進行編輯的架構。然而，由於模型架構的多樣性以及訓練範式與實作細節的異質性，構建UMMs的統一框架仍具挑戰性。本文提出TorchUMM——首個支援跨UMM骨幹模型、任務與資料集的綜合評估、分析及訓練後優化的統一程式碼庫。TorchUMM涵蓋多種規模與設計範式的廣泛模型，其基準測試包含三大核心任務維度：多模態理解、生成與編輯，並整合經典與新興資料集以評估模型的感知、推理、組合性及指令遵循能力。通過提供統一介面與標準化評估協議，TorchUMM實現了異質模型間的公平可重現比較，有助於深入洞察其優勢與局限，從而推動更強大的統一多模態系統發展。程式碼公開於：https://github.com/AIFrontierLab/TorchUMM。

English

Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.

TorchUMM：面向评估、分析與訓練後處理的統一多模態模型代碼庫

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

摘要

Support