MMG-Ego4D：エゴセントリック行動認識におけるマルチモーダル汎化

要旨

本論文では、エゴセントリックな行動認識における新たな問題を研究し、これを「マルチモーダル汎化」（MMG）と名付けます。MMGは、特定のモダリティのデータが限られているか、完全に欠落している場合に、システムがどのように汎化できるかを研究することを目的としています。我々は、標準的な教師あり行動認識と、より挑戦的な少数ショット設定での新しい行動カテゴリの学習という文脈で、MMGを徹底的に調査します。MMGは、実世界のアプリケーションにおけるセキュリティと効率性の考慮をサポートするために設計された2つの新たなシナリオで構成されます：（1）推論時にトレーニング時に存在していた一部のモダリティが欠落している場合のモダリティ欠落汎化、（2）推論時とトレーニング時に存在するモダリティが互いに排他的である場合のクロスモーダルゼロショット汎化。この調査を可能にするため、我々はビデオ、オーディオ、慣性モーションセンサー（IMU）のモダリティを持つデータポイントを含む新しいデータセットMMG-Ego4Dを構築しました。このデータセットはEgo4Dデータセットから派生していますが、MMG問題の研究を促進するために人間の専門家によって処理され、徹底的に再アノテーションされています。我々はMMG-Ego4D上で多様なモデルを評価し、汎化能力を向上させた新しい手法を提案します。特に、モダリティドロップアウトトレーニング、コントラスティブベースのアライメントトレーニング、そして少数ショット性能を向上させるための新しいクロスモーダルプロトタイプ損失を備えた新しい融合モジュールを導入します。この研究がマルチモーダル汎化問題におけるベンチマークとなり、将来の研究の指針となることを願っています。ベンチマークとコードはhttps://github.com/facebookresearch/MMG_Ego4Dで公開されます。

English

In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code will be available at https://github.com/facebookresearch/MMG_Ego4D.

MMG-Ego4D：エゴセントリック行動認識におけるマルチモーダル汎化

MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

要旨

Support