MMG-Ego4D：以自我中心行動識別為基礎的多模態泛化

摘要

本文研究了一個新穎的視角動作識別問題，我們稱之為「多模態泛化」（MMG）。MMG旨在研究系統如何在某些模態的數據受限或完全缺失時進行泛化。我們在標準監督動作識別和更具挑戰性的少樣本設置中深入研究了MMG。MMG包含兩個新穎的情境，旨在支持現實應用中的安全性和效率考量：（1）缺失模態泛化，即在推斷時缺少訓練時存在的某些模態；（2）跨模態零樣本泛化，即推斷時和訓練時存在的模態不相交。為了進行這一研究，我們構建了一個新的數據集MMG-Ego4D，其中包含視頻、音頻和慣性運動傳感器（IMU）模態的數據點。我們的數據集源自Ego4D數據集，但經過人類專家的處理和詳細重新標註，以促進對MMG問題的研究。我們在MMG-Ego4D上評估了多種模型，並提出了具有改進泛化能力的新方法。具體來說，我們引入了一個新的融合模塊，包括模態丟棄訓練、基於對比的對齊訓練，以及一種新的跨模態原型損失，以提高少樣本性能。我們希望這項研究能成為多模態泛化問題的基準，並指導未來的研究。基準和代碼將在https://github.com/facebookresearch/MMG_Ego4D 上提供。

English

In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code will be available at https://github.com/facebookresearch/MMG_Ego4D.

MMG-Ego4D：以自我中心行動識別為基礎的多模態泛化

MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

摘要

Support