RAISE: 학습 없이 텍스트-이미지 정렬을 위한 요구사항 적응형 진화적 정제

초록

최근 텍스트-이미지(T2I) 확산 모델은 놀라운 사실감을 달성했지만, 특히 여러 객체, 관계 및 세밀한 속성을 포함하는 복잡한 프롬프트에 대해 프롬프트-이미지 정합성을 충실히 유지하는 것은 여전히 어려운 과제입니다. 기존의 훈련이 필요 없는 추론 시 스케일링 방법은 프롬프트 난이도에 적응할 수 없는 고정된 반복 예산에 의존하는 반면, 리플렉션 튜닝 모델은 신중하게 구축된 리플렉션 데이터셋과 확산 모델 및 시각-언어 모델의 광범위한 공동 미세 조정이 필요하며, 종종 리플렉션 경로 데이터에 과적합되고 모델 간 전이성이 부족합니다. 본 논문에서는 적응형 T2I 생성을 위한 훈련이 필요 없고 요구 사항 주도적인 진화 프레임워크인 RAISE(Requirement-Adaptive Self-Improving Evolution)를 소개합니다. RAISE는 이미지 생성을 요구 사항 주도적 적응형 스케일링 과정으로 공식화하며, 프롬프트 재작성, 노이즈 재샘플링, 지시 기반 편집 등 다양한 세련화 작업을 통해 추론 시점에 후보 이미지 집단을 진화시킵니다. 각 세대는 구조화된 요구 사항 체크리스트에 대해 검증되어, 시스템이 충족되지 않은 항목을 동적으로 식별하고 필요한 부분에만 추가 계산을 할당할 수 있게 합니다. 이를 통해 의미적 쿼리 복잡도에 계산 자원을 맞추는 적응형 테스트 타임 스케일링을 달성합니다. GenEval 및 DrawBench에서 RAISE는 기존 스케일링 및 리플렉션 튜닝 기준선 대비 생성 샘플 수(30-40% 감소)와 VLM 호출 횟수(80% 감소)를 줄이면서도 최첨단 정합성(GenEval 전체 0.94)을 달성하여 효율적이고 일반화 가능하며 모델에 구애받지 않는 다중 라운드 자기 개선을 입증했습니다. 코드는 https://github.com/LiyaoJiang1998/RAISE 에서 확인할 수 있습니다.

English

Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.

RAISE: 학습 없이 텍스트-이미지 정렬을 위한 요구사항 적응형 진화적 정제

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

초록

Support