ChatPaper.aiChatPaper

元变换器:一种多模态学习的统一框架

Meta-Transformer: A Unified Framework for Multimodal Learning

July 20, 2023
作者: Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue
cs.AI

摘要

多模态学习旨在构建能够处理和关联来自多种模态的信息的模型。尽管这一领域经过多年的发展,但由于它们之间固有的差距,设计一个用于处理各种模态(如自然语言、2D图像、3D点云、音频、视频、时间序列、表格数据)的统一网络仍然具有挑战性。在这项工作中,我们提出了一个名为 Meta-Transformer 的框架,利用一个冻结的编码器来执行多模态感知,而无需任何配对的多模态训练数据。在 Meta-Transformer 中,来自各种模态的原始输入数据被映射到一个共享的标记空间,使得随后的编码器能够提取输入数据的高级语义特征。由统一数据标记器、模态共享编码器和针对下游任务的任务特定头部组成,Meta-Transformer 是第一个能够使用未配对数据在 12 种模态之间执行统一学习的框架。在不同基准测试上的实验表明,Meta-Transformer 能够处理广泛的任务,包括基础感知(文本、图像、点云、音频、视频)、实际应用(X射线、红外、高光谱和IMU)以及数据挖掘(图形、表格和时间序列)。Meta-Transformer 为利用 transformer 开发统一的多模态智能指示了一个充满希望的未来。代码将在 https://github.com/invictus717/MetaTransformer 上提供。
English
Multimodal learning aims to build models that can process and relate information from multiple modalities. Despite years of development in this field, it still remains challenging to design a unified network for processing various modalities (e.g. natural language, 2D images, 3D point clouds, audio, video, time series, tabular data) due to the inherent gaps among them. In this work, we propose a framework, named Meta-Transformer, that leverages a frozen encoder to perform multimodal perception without any paired multimodal training data. In Meta-Transformer, the raw input data from various modalities are mapped into a shared token space, allowing a subsequent encoder with frozen parameters to extract high-level semantic features of the input data. Composed of three main components: a unified data tokenizer, a modality-shared encoder, and task-specific heads for downstream tasks, Meta-Transformer is the first framework to perform unified learning across 12 modalities with unpaired data. Experiments on different benchmarks reveal that Meta-Transformer can handle a wide range of tasks including fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Meta-Transformer indicates a promising future for developing unified multimodal intelligence with transformers. Code will be available at https://github.com/invictus717/MetaTransformer
PDF443December 15, 2024