基於視覺指導的學習自我中心動作框生成:LEGO 調整
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
December 6, 2023
作者: Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M. Rehg, Miao Liu
cs.AI
摘要
從自我中心的觀點生成人類日常動作的指導圖像是實現有效技能傳遞的關鍵步驟。本文介紹了一個新穎的問題--自我中心動作幀生成。其目標是合成動作幀,條件是用戶提示問題和捕捉用戶環境的輸入自我中心圖像。值得注意的是,現有的自我中心數據集缺乏描述動作執行細節的詳細標註。此外,基於擴散的圖像操作模型無法控制動作在相應自我中心圖像像素空間內的狀態變化。為此,我們通過視覺指導調整來微調視覺大型語言模型(VLLM),以編纂豐富的動作描述來應對我們提出的問題。此外,我們提出使用VLLM的圖像和文本嵌入來進行額外條件設置,實現學習自我中心(LEGO)動作幀生成。我們在兩個自我中心數據集--Ego4D和Epic-Kitchens上驗證了我們提出的模型。我們的實驗在定量和定性評估中顯示出明顯的改進,優於先前的圖像操作模型。我們還進行了詳細的消融研究和分析,以提供有關我們方法的見解。
English
Generating instructional images of human daily actions from an egocentric
viewpoint serves a key step towards efficient skill transfer. In this paper, we
introduce a novel problem -- egocentric action frame generation. The goal is to
synthesize the action frame conditioning on the user prompt question and an
input egocentric image that captures user's environment. Notably, existing
egocentric datasets lack the detailed annotations that describe the execution
of actions. Additionally, the diffusion-based image manipulation models fail to
control the state change of an action within the corresponding egocentric image
pixel space. To this end, we finetune a visual large language model (VLLM) via
visual instruction tuning for curating the enriched action descriptions to
address our proposed problem. Moreover, we propose to Learn EGOcentric (LEGO)
action frame generation using image and text embeddings from VLLM as additional
conditioning. We validate our proposed model on two egocentric datasets --
Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior
image manipulation models in both quantitative and qualitative evaluation. We
also conduct detailed ablation studies and analysis to provide insights on our
method.