재생을 통한 정제: 수정 공간 확장이 통합 멀티모달 모델의 이미지 정제 성능 향상에 미치는 영향

초록

통합 멀티모달 모델(UMM)은 시각적 이해와 생성을 단일 프레임워크 내에 통합합니다. 텍스트-이미지(T2I) 작업에서 이러한 통합 능력은 UMM이 초기 생성 후 출력을 정제하여 성능 상한선을 확장할 수 있게 합니다. 현재 UMM 기반 정제 방법은 주로 정제-편집(RvE) 패러다임을 따르며, UMM이 정렬된 콘텐츠를 보존하면서 잘못 정렬된 영역을 수정하기 위한 편집 지침을 생성합니다. 그러나 편집 지침은 프롬프트-이미지 불일치를 대체로 개략적으로만 설명하여 불완전한 정제로 이어지는 경우가 많습니다. 또한 픽셀 수준의 보존은 편집에 필요하지만 정제를 위한 효과적인 수정 공간을 불필요하게 제한합니다. 이러한 한계를 해결하기 위해 우리는 정제를 편집이 아닌 조건부 이미지 재생성으로 재정의하는 새로운 프레임워크인 재생성을 통한 정제(RvR)를 제안합니다. RvR은 편집 지침에 의존하거나 엄격한 콘텐츠 보존을 강제하는 대신, 대상 프롬프트와 초기 이미지의 의미론적 토큰을 조건으로 이미지를 재생성하여 더 넓은 수정 공간에서 더 완전한 의미론적 정렬을 가능하게 합니다. 대규모 실험을 통해 RvR의 효과를 입증하였으며, Geneval을 0.78에서 0.91로, DPGBench을 84.02에서 87.21로, UniGenBench++을 61.53에서 77.41로 향상시켰습니다.

English

Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.

재생을 통한 정제: 수정 공간 확장이 통합 멀티모달 모델의 이미지 정제 성능 향상에 미치는 영향

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

초록

Support