LEGO：通过视觉指导学习基于自我中心行为帧生成的调整

摘要

从自我中心视角生成人类日常行为的指导性图像是实现高效技能转移的关键步骤。本文介绍了一个新颖的问题 -- 自我中心动作帧生成。其目标是在用户提示问题和捕捉用户环境的输入自我中心图像的条件下合成动作帧。值得注意的是，现有的自我中心数据集缺乏描述动作执行细节的详细注释。此外，基于扩散的图像操作模型无法控制动作在相应自我中心图像像素空间内的状态变化。因此，我们通过视觉指导调整对视觉大型语言模型（VLLM）进行微调，以筛选丰富的动作描述，以解决我们提出的问题。此外，我们提出使用来自VLLM的图像和文本嵌入进行额外调节的Learn EGOcentric（LEGO）动作帧生成。我们在两个自我中心数据集 -- Ego4D 和 Epic-Kitchens 上验证了我们提出的模型。我们的实验显示，与先前的图像操作模型相比，在定量和定性评估方面均取得了显著改进。我们还进行了详细的消融研究和分析，以提供关于我们方法的见解。

English

Generating instructional images of human daily actions from an egocentric viewpoint serves a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize the action frame conditioning on the user prompt question and an input egocentric image that captures user's environment. Notably, existing egocentric datasets lack the detailed annotations that describe the execution of actions. Additionally, the diffusion-based image manipulation models fail to control the state change of an action within the corresponding egocentric image pixel space. To this end, we finetune a visual large language model (VLLM) via visual instruction tuning for curating the enriched action descriptions to address our proposed problem. Moreover, we propose to Learn EGOcentric (LEGO) action frame generation using image and text embeddings from VLLM as additional conditioning. We validate our proposed model on two egocentric datasets -- Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights on our method.

LEGO：通过视觉指导学习基于自我中心行为帧生成的调整

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

摘要

Support