RAISE: トレーニング不要のテキスト-画像連携のための要求適応型進化的改良

要旨

近年のテキストから画像（T2I）拡散モデルは驚異的なリアリズムを達成しているが、特に複数のオブジェクト、関係性、細かい属性を含む複雑なプロンプトに対する忠実なプロンプト-画像対応は依然として課題である。既存の学習不要な推論時スケーリング手法は、プロンプトの難易度に適応できない固定の反復回数に依存している。一方、反射調整モデルは注意深く選別された反射データセットと、拡散モデル及び視覚言語モデルの大規模な共同ファインチューニングを必要とし、反射パスデータへの過学習が生じやすく、モデル間での転移性に欠ける。本論文では、RAISE（Requirement-Adaptive Self-Improving Evolution）を提案する。これは、適応的T2I生成のための、学習不要で要求駆動型の進化的フレームワークである。RAISEは、画像生成を要求駆動型の適応的スケーリングプロセスとして定式化し、推論時に候補画像群を多様な改良アクション（プロンプト書き換え、ノイズ再サンプリング、指示編集を含む）を通じて進化させる。各世代は構造化された要求チェックリストに対して検証され、システムは未充足項目を動的に特定し、必要な箇所にのみ計算リソースを割り当てる。これにより、意味的クエリの複雑さに計算量を適合させる適応的テスト時スケーリングを実現する。GenEvalおよびDrawBenchにおける評価では、RAISEは従来のスケーリング手法や反射調整ベースラインと比較して、より少ない生成画像数（30-40%削減）とVLM呼び出し回数（80%削減）で、最先端の対応精度（GenEval総合0.94）を達成し、効率的で一般性が高く、モデルに依存しないマルチラウンドの自己改善能力を示した。コードはhttps://github.com/LiyaoJiang1998/RAISE で公開されている。

English

Recent text-to-image (T2I) diffusion models achieve remarkable realism, yet faithful prompt-image alignment remains challenging, particularly for complex prompts with multiple objects, relations, and fine-grained attributes. Existing training-free inference-time scaling methods rely on fixed iteration budgets that cannot adapt to prompt difficulty, while reflection-tuned models require carefully curated reflection datasets and extensive joint fine-tuning of diffusion and vision-language models, often overfitting to reflection paths data and lacking transferability across models. We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. RAISE formulates image generation as a requirement-driven adaptive scaling process, evolving a population of candidates at inference time through a diverse set of refinement actions-including prompt rewriting, noise resampling, and instructional editing. Each generation is verified against a structured checklist of requirements, enabling the system to dynamically identify unsatisfied items and allocate further computation only where needed. This achieves adaptive test-time scaling that aligns computational effort with semantic query complexity. On GenEval and DrawBench, RAISE attains state-of-the-art alignment (0.94 overall GenEval) while incurring fewer generated samples (reduced by 30-40%) and VLM calls (reduced by 80%) than prior scaling and reflection-tuned baselines, demonstrating efficient, generalizable, and model-agnostic multi-round self-improvement. Code is available at https://github.com/LiyaoJiang1998/RAISE.

RAISE: トレーニング不要のテキスト-画像連携のための要求適応型進化的改良

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment

要旨

Support