基於視覺指導的學習自我中心動作框生成：LEGO 調整

摘要

從自我中心的觀點生成人類日常動作的指導圖像是實現有效技能傳遞的關鍵步驟。本文介紹了一個新穎的問題--自我中心動作幀生成。其目標是合成動作幀，條件是用戶提示問題和捕捉用戶環境的輸入自我中心圖像。值得注意的是，現有的自我中心數據集缺乏描述動作執行細節的詳細標註。此外，基於擴散的圖像操作模型無法控制動作在相應自我中心圖像像素空間內的狀態變化。為此，我們通過視覺指導調整來微調視覺大型語言模型（VLLM），以編纂豐富的動作描述來應對我們提出的問題。此外，我們提出使用VLLM的圖像和文本嵌入來進行額外條件設置，實現學習自我中心（LEGO）動作幀生成。我們在兩個自我中心數據集--Ego4D和Epic-Kitchens上驗證了我們提出的模型。我們的實驗在定量和定性評估中顯示出明顯的改進，優於先前的圖像操作模型。我們還進行了詳細的消融研究和分析，以提供有關我們方法的見解。

English

Generating instructional images of human daily actions from an egocentric viewpoint serves a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize the action frame conditioning on the user prompt question and an input egocentric image that captures user's environment. Notably, existing egocentric datasets lack the detailed annotations that describe the execution of actions. Additionally, the diffusion-based image manipulation models fail to control the state change of an action within the corresponding egocentric image pixel space. To this end, we finetune a visual large language model (VLLM) via visual instruction tuning for curating the enriched action descriptions to address our proposed problem. Moreover, we propose to Learn EGOcentric (LEGO) action frame generation using image and text embeddings from VLLM as additional conditioning. We validate our proposed model on two egocentric datasets -- Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights on our method.

基於視覺指導的學習自我中心動作框生成：LEGO 調整

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

摘要

Support