MMG-Ego4D：以第一视角动作识别中的多模态泛化

摘要

本文研究了一种新颖的视角动作识别问题，我们称之为“多模态泛化”（MMG）。MMG旨在研究系统在某些模态的数据受限或甚至完全缺失时如何泛化。我们在标准监督动作识别和更具挑战性的少样本设置中全面调查了MMG。MMG包括两种新颖场景，旨在支持真实应用中的安全性和效率考虑：（1）缺失模态泛化，即在推断时缺少训练时存在的某些模态；（2）跨模态零样本泛化，即推断时和训练时存在的模态不相交。为了进行这项研究，我们构建了一个新数据集MMG-Ego4D，其中包含视频、音频和惯性运动传感器（IMU）模态的数据点。我们的数据集源自Ego4D数据集，但经过人类专家处理和彻底重新注释，以促进对MMG问题的研究。我们在MMG-Ego4D上评估了多种模型，并提出了具有改进泛化能力的新方法。特别是，我们引入了一个新的融合模块，采用模态丢弃训练、基于对比的对齐训练以及一种新颖的跨模态原型损失，以提高少样本性能。我们希望这项研究能够成为多模态泛化问题的基准，并指导未来的研究。基准和代码将在https://github.com/facebookresearch/MMG_Ego4D 上提供。

English

In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code will be available at https://github.com/facebookresearch/MMG_Ego4D.

MMG-Ego4D：以第一视角动作识别中的多模态泛化

MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

摘要

Support