LEGO: 視覚的指示チューニングによるエゴセントリック行動フレーム生成の学習

要旨

人間の日常行動をエゴセントリックな視点から指導用画像を生成することは、効率的なスキル伝達に向けた重要なステップである。本論文では、新たな課題としてエゴセントリックなアクションフレーム生成を提案する。この課題の目的は、ユーザーのプロンプト質問と、ユーザーの環境を捉えた入力エゴセントリック画像に基づいて、アクションフレームを合成することである。特に、既存のエゴセントリックデータセットには、アクションの実行を詳細に記述したアノテーションが欠けている。さらに、拡散ベースの画像操作モデルは、対応するエゴセントリック画像のピクセル空間内でアクションの状態変化を制御することができない。この問題に対処するため、視覚的大規模言語モデル（VLLM）を視覚的指示チューニングによって微調整し、豊富なアクション記述をキュレーションする。さらに、VLLMから得られた画像とテキストの埋め込みを追加の条件として用いて、LEGO（Learn EGOcentric）アクションフレーム生成を提案する。提案モデルを2つのエゴセントリックデータセット（Ego4DとEpic-Kitchens）で検証し、従来の画像操作モデルと比較して定量的および定性的な評価において顕著な改善を示す。また、詳細なアブレーション研究と分析を行い、本手法の洞察を提供する。

English

Generating instructional images of human daily actions from an egocentric viewpoint serves a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize the action frame conditioning on the user prompt question and an input egocentric image that captures user's environment. Notably, existing egocentric datasets lack the detailed annotations that describe the execution of actions. Additionally, the diffusion-based image manipulation models fail to control the state change of an action within the corresponding egocentric image pixel space. To this end, we finetune a visual large language model (VLLM) via visual instruction tuning for curating the enriched action descriptions to address our proposed problem. Moreover, we propose to Learn EGOcentric (LEGO) action frame generation using image and text embeddings from VLLM as additional conditioning. We validate our proposed model on two egocentric datasets -- Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights on our method.

LEGO: 視覚的指示チューニングによるエゴセントリック行動フレーム生成の学習

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

要旨

Support