TorchUMM：面向评估、分析与后训练的统一多模态模型代码库

摘要

近年来，统一多模态模型（UMMs）的研究进展催生了多种能够理解、生成和编辑视觉与文本模态的架构。然而，由于模型架构的多样性以及训练范式和实现细节的异构性，构建统一的UMM框架仍面临挑战。本文提出TorchUMM——首个支持跨不同UMM骨干网络、任务和数据集进行全面评估、分析及训练后处理的统一代码库。该平台涵盖多种规模与设计范式的模型，其基准测试包含多模态理解、生成与编辑三大核心任务维度，并整合经典与新兴数据集以评估模型的感知、推理、组合性及指令遵循能力。通过提供统一接口和标准化评估协议，TorchUMM实现了异构模型间的公平可复现比较，有助于深入理解其优势与局限，进而推动更强大的统一多模态系统发展。代码已开源：https://github.com/AIFrontierLab/TorchUMM。

English

Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.

TorchUMM：面向评估、分析与后训练的统一多模态模型代码库

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

摘要

Support