LEGO: 시각적 지시 튜닝을 통한 자기 중심적 액션 프레임 생성 학습

초록

인간의 일상 행동을 에고센트릭(egocentric) 시점에서의 교육용 이미지를 생성하는 것은 효율적인 기술 전달을 위한 핵심 단계로 작용합니다. 본 논문에서는 에고센트릭 액션 프레임 생성이라는 새로운 문제를 소개합니다. 이 문제의 목표는 사용자 프롬프트 질문과 사용자 환경을 포착한 입력 에고센트릭 이미지를 조건으로 하여 액션 프레임을 합성하는 것입니다. 특히, 기존의 에고센트릭 데이터셋은 행동 실행을 상세히 설명하는 주석이 부족합니다. 또한, 확산 기반 이미지 조작 모델들은 해당 에고센트릭 이미지 픽셀 공간 내에서 행동의 상태 변화를 제어하는 데 실패합니다. 이를 해결하기 위해, 우리는 시각적 대형 언어 모델(VLLM)을 시각적 지침 튜닝을 통해 미세 조정하여 풍부한 행동 설명을 구축하고자 합니다. 더 나아가, VLLM에서 추출한 이미지와 텍스트 임베딩을 추가 조건으로 사용하여 에고센트릭(LEGO) 액션 프레임 생성을 학습하는 방법을 제안합니다. 우리는 제안된 모델을 Ego4D와 Epic-Kitchens 두 가지 에고센트릭 데이터셋에서 검증합니다. 실험 결과, 기존의 이미지 조작 모델들에 비해 양적 및 질적 평가에서 뚜렷한 개선을 보여줍니다. 또한, 우리는 방법론에 대한 통찰을 제공하기 위해 상세한 어블레이션 연구와 분석을 수행합니다.

English

Generating instructional images of human daily actions from an egocentric viewpoint serves a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize the action frame conditioning on the user prompt question and an input egocentric image that captures user's environment. Notably, existing egocentric datasets lack the detailed annotations that describe the execution of actions. Additionally, the diffusion-based image manipulation models fail to control the state change of an action within the corresponding egocentric image pixel space. To this end, we finetune a visual large language model (VLLM) via visual instruction tuning for curating the enriched action descriptions to address our proposed problem. Moreover, we propose to Learn EGOcentric (LEGO) action frame generation using image and text embeddings from VLLM as additional conditioning. We validate our proposed model on two egocentric datasets -- Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights on our method.

LEGO: 시각적 지시 튜닝을 통한 자기 중심적 액션 프레임 생성 학습

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

초록

Support