지시어 기반 이미지 편집을 위한 시각적 자기회귀 모델링

초록

최근 확산 모델의 발전은 지시 기반 이미지 편집에 있어 놀라운 시각적 충실도를 가져왔습니다. 그러나 이러한 모델의 전역적 노이즈 제거 과정은 필연적으로 편집된 영역을 전체 이미지 맥락과 얽히게 하여, 의도하지 않은 부수적 수정과 편집 지시에 대한 충실도 저하를 초래합니다. 이와 대조적으로, 자기회귀 모델은 이미지 합성을 이산적 시각 토큰에 대한 순차적 과정으로 공식화함으로써 독자적인 패러다임을 제공합니다. 이들의 인과적 및 구성적 메커니즘은 확산 기반 방법의 충실도 문제를 자연스럽게 우회합니다. 본 논문에서는 이미지 편집을 다음 스케일 예측 문제로 재구성하는 시각적 자기회귀(VAR) 프레임워크인 VAREdit를 소개합니다. VAREdit는 원본 이미지 특징과 텍스트 지시를 조건으로 하여 다중 스케일 대상 특징을 생성함으로써 정밀한 편집을 달성합니다. 이 패러다임에서의 핵심 과제는 원본 이미지 토큰을 효과적으로 조건화하는 방법입니다. 우리는 가장 미세한 스케일의 원본 특징이 더 거친 대상 특징의 예측을 효과적으로 안내할 수 없다는 점을 관찰했습니다. 이러한 격차를 해소하기 위해, 우리는 스케일 정렬 참조(SAR) 모듈을 도입하여 첫 번째 자기 주의 계층에 스케일이 일치하는 조건화 정보를 주입합니다. VAREdit는 편집 충실도와 효율성 모두에서 상당한 진전을 보여줍니다. 표준 벤치마크에서 VAREdit는 선두 확산 기반 방법보다 30% 이상 높은 GPT-Balance 점수를 기록했습니다. 또한, 512x512 크기의 편집을 1.2초 내에 완료하여 유사한 크기의 UltraEdit보다 2.2배 빠른 성능을 보입니다. 모델은 https://github.com/HiDream-ai/VAREdit에서 확인할 수 있습니다.

English

Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a 512times512 editing in 1.2 seconds, making it 2.2times faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.

지시어 기반 이미지 편집을 위한 시각적 자기회귀 모델링

Visual Autoregressive Modeling for Instruction-Guided Image Editing

초록

Support