Re-Align: 인컨텍스트 이미지 생성 및 편집을 위한 구조적 추론 기반 정렬

초록

컨텍스트 내 이미지 생성 및 편집(ICGE)은 사용자가 이미지-텍스트가 혼합된 프롬프트를 통해 시각적 개념을 지정할 수 있게 하여, 사용자 의도의 정확한 이해와 충실한 실행을 요구합니다. 최근 통합 멀티모달 모델들은 유망한 이해 능력을 보여주지만, 이러한 장점이 이미지 생성으로 효과적으로 이어지지 않는 경우가 많습니다. 우리는 구조화된 추론 기반 정렬을 통해 이해와 생성 간 격차를 해소하는 통합 프레임워크인 Re-Align을 소개합니다. 그 핵심에는 의미 지침과 참조 연관성을 분리하여 명확한 텍스트 목표를 제공하고 참조 이미지 간 혼란을 완화하는 구조화된 추론 패러다임인 인-컨텍스트 생각의 사슬(IC-CoT)이 있습니다. 더불어 Re-Align은 구조화된 추론 텍스트와 생성된 이미지 간 정렬을 측정하는 대리 보상을 활용하는 효과적인 RL 훈련 기법을 도입하여 ICGE 작업에서 모델의 전반적 성능을 향상시킵니다. 폭넓은 실험을 통해 Re-Align이 동등한 모델 규모와 자원을 가진 경쟁 방법들보다 컨텍스트 내 이미지 생성 및 편집 작업 모두에서 더 뛰어난 성능을 보임을 검증했습니다.

English

In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

Re-Align: 인컨텍스트 이미지 생성 및 편집을 위한 구조적 추론 기반 정렬

Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

초록

Support