MMG-Ego4D:以自我中心行動識別為基礎的多模態泛化
MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition
May 12, 2023
作者: Xinyu Gong, Sreyas Mohan, Naina Dhingra, Jean-Charles Bazin, Yilei Li, Zhangyang Wang, Rakesh Ranjan
cs.AI
摘要
本文研究了一個新穎的視角動作識別問題,我們稱之為「多模態泛化」(MMG)。MMG旨在研究系統如何在某些模態的數據受限或完全缺失時進行泛化。我們在標準監督動作識別和更具挑戰性的少樣本設置中深入研究了MMG。MMG包含兩個新穎的情境,旨在支持現實應用中的安全性和效率考量:(1)缺失模態泛化,即在推斷時缺少訓練時存在的某些模態;(2)跨模態零樣本泛化,即推斷時和訓練時存在的模態不相交。為了進行這一研究,我們構建了一個新的數據集MMG-Ego4D,其中包含視頻、音頻和慣性運動傳感器(IMU)模態的數據點。我們的數據集源自Ego4D數據集,但經過人類專家的處理和詳細重新標註,以促進對MMG問題的研究。我們在MMG-Ego4D上評估了多種模型,並提出了具有改進泛化能力的新方法。具體來說,我們引入了一個新的融合模塊,包括模態丟棄訓練、基於對比的對齊訓練,以及一種新的跨模態原型損失,以提高少樣本性能。我們希望這項研究能成為多模態泛化問題的基準,並指導未來的研究。基準和代碼將在https://github.com/facebookresearch/MMG_Ego4D 上提供。
English
In this paper, we study a novel problem in egocentric action recognition,
which we term as "Multimodal Generalization" (MMG). MMG aims to study how
systems can generalize when data from certain modalities is limited or even
completely missing. We thoroughly investigate MMG in the context of standard
supervised action recognition and the more challenging few-shot setting for
learning new action categories. MMG consists of two novel scenarios, designed
to support security, and efficiency considerations in real-world applications:
(1) missing modality generalization where some modalities that were present
during the train time are missing during the inference time, and (2)
cross-modal zero-shot generalization, where the modalities present during the
inference time and the training time are disjoint. To enable this
investigation, we construct a new dataset MMG-Ego4D containing data points with
video, audio, and inertial motion sensor (IMU) modalities. Our dataset is
derived from Ego4D dataset, but processed and thoroughly re-annotated by human
experts to facilitate research in the MMG problem. We evaluate a diverse array
of models on MMG-Ego4D and propose new methods with improved generalization
ability. In particular, we introduce a new fusion module with modality dropout
training, contrastive-based alignment training, and a novel cross-modal
prototypical loss for better few-shot performance. We hope this study will
serve as a benchmark and guide future research in multimodal generalization
problems. The benchmark and code will be available at
https://github.com/facebookresearch/MMG_Ego4D.