AR-RAG：画像生成のための自己回帰的検索拡張

要旨

我々は、パッチレベルでk最近傍検索を自己回帰的に組み込むことで画像生成を強化する新しいパラダイムであるAutoregressive Retrieval Augmentation（AR-RAG）を提案する。従来の手法が生成前に単一の静的な検索を行い、固定された参照画像に基づいて生成全体を条件付けるのに対し、AR-RAGは各生成ステップで文脈を考慮した検索を行い、事前に生成されたパッチをクエリとして使用して最も関連性の高いパッチレベルの視覚的参照を取得し、取り込むことで、モデルが進化する生成ニーズに対応しつつ、既存の手法に顕著な制約（例：過剰なコピー、スタイルの偏りなど）を回避することを可能にする。AR-RAGを実現するために、我々は2つの並列フレームワークを提案する：（1）Distribution-Augmentation in Decoding（DAiD）は、モデルが予測したパッチの分布と取得したパッチの分布を直接統合するトレーニング不要のプラグアンドプレイデコーディング戦略であり、（2）Feature-Augmentation in Decoding（FAiD）は、マルチスケール畳み込み操作を通じて取得したパッチの特徴を段階的に平滑化し、それらを活用して画像生成プロセスを強化するパラメータ効率の良いファインチューニング手法である。我々は、Midjourney-30K、GenEval、DPG-Benchなどの広く採用されているベンチマークでAR-RAGの有効性を検証し、最先端の画像生成モデルを大幅に上回る性能向上を示した。

English

We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating knearest neighbor retrievals at the patch level. Unlike prior methods that perform a single, static retrieval before generation and condition the entire generation on fixed reference images, AR-RAG performs context-aware retrievals at each generation step, using prior-generated patches as queries to retrieve and incorporate the most relevant patch-level visual references, enabling the model to respond to evolving generation needs while avoiding limitations (e.g., over-copying, stylistic bias, etc.) prevalent in existing methods. To realize AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in Decoding (DAiD), a training-free plug-and-use decoding strategy that directly merges the distribution of model-predicted patches with the distribution of retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a parameter-efficient fine-tuning method that progressively smooths the features of retrieved patches via multi-scale convolution operations and leverages them to augment the image generation process. We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and DPG-Bench, demonstrating significant performance gains over state-of-the-art image generation models.

AR-RAG：画像生成のための自己回帰的検索拡張

AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

要旨

Support