MMG-Ego4D:以第一视角动作识别中的多模态泛化
MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition
May 12, 2023
作者: Xinyu Gong, Sreyas Mohan, Naina Dhingra, Jean-Charles Bazin, Yilei Li, Zhangyang Wang, Rakesh Ranjan
cs.AI
摘要
本文研究了一种新颖的视角动作识别问题,我们称之为“多模态泛化”(MMG)。MMG旨在研究系统在某些模态的数据受限或甚至完全缺失时如何泛化。我们在标准监督动作识别和更具挑战性的少样本设置中全面调查了MMG。MMG包括两种新颖场景,旨在支持真实应用中的安全性和效率考虑:(1)缺失模态泛化,即在推断时缺少训练时存在的某些模态;(2)跨模态零样本泛化,即推断时和训练时存在的模态不相交。为了进行这项研究,我们构建了一个新数据集MMG-Ego4D,其中包含视频、音频和惯性运动传感器(IMU)模态的数据点。我们的数据集源自Ego4D数据集,但经过人类专家处理和彻底重新注释,以促进对MMG问题的研究。我们在MMG-Ego4D上评估了多种模型,并提出了具有改进泛化能力的新方法。特别是,我们引入了一个新的融合模块,采用模态丢弃训练、基于对比的对齐训练以及一种新颖的跨模态原型损失,以提高少样本性能。我们希望这项研究能够成为多模态泛化问题的基准,并指导未来的研究。基准和代码将在https://github.com/facebookresearch/MMG_Ego4D 上提供。
English
In this paper, we study a novel problem in egocentric action recognition,
which we term as "Multimodal Generalization" (MMG). MMG aims to study how
systems can generalize when data from certain modalities is limited or even
completely missing. We thoroughly investigate MMG in the context of standard
supervised action recognition and the more challenging few-shot setting for
learning new action categories. MMG consists of two novel scenarios, designed
to support security, and efficiency considerations in real-world applications:
(1) missing modality generalization where some modalities that were present
during the train time are missing during the inference time, and (2)
cross-modal zero-shot generalization, where the modalities present during the
inference time and the training time are disjoint. To enable this
investigation, we construct a new dataset MMG-Ego4D containing data points with
video, audio, and inertial motion sensor (IMU) modalities. Our dataset is
derived from Ego4D dataset, but processed and thoroughly re-annotated by human
experts to facilitate research in the MMG problem. We evaluate a diverse array
of models on MMG-Ego4D and propose new methods with improved generalization
ability. In particular, we introduce a new fusion module with modality dropout
training, contrastive-based alignment training, and a novel cross-modal
prototypical loss for better few-shot performance. We hope this study will
serve as a benchmark and guide future research in multimodal generalization
problems. The benchmark and code will be available at
https://github.com/facebookresearch/MMG_Ego4D.