AR-RAG:用於圖像生成的自回歸檢索增強
AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
June 8, 2025
作者: Jingyuan Qi, Zhiyang Xu, Qifan Wang, Lifu Huang
cs.AI
摘要
我們提出了自迴歸檢索增強(Autoregressive Retrieval Augmentation, AR-RAG),這是一種新穎的範式,通過在圖像生成過程中自迴歸地融入基於補丁層面的k近鄰檢索來增強圖像生成效果。與先前方法在生成前進行單次靜態檢索並將整個生成過程固定於參考圖像不同,AR-RAG在每個生成步驟中執行上下文感知的檢索,利用先前生成的補丁作為查詢來檢索並整合最相關的補丁級視覺參考,使模型能夠響應不斷變化的生成需求,同時避免現有方法中普遍存在的問題(如過度複製、風格偏見等)。為實現AR-RAG,我們提出了兩個並行框架:(1)解碼中的分佈增強(Distribution-Augmentation in Decoding, DAiD),這是一種無需訓練的即插即用解碼策略,直接將模型預測補丁的分佈與檢索補丁的分佈進行融合;(2)解碼中的特徵增強(Feature-Augmentation in Decoding, FAiD),這是一種參數高效的微調方法,通過多尺度卷積操作逐步平滑檢索補丁的特徵,並利用這些特徵來增強圖像生成過程。我們在廣泛採用的基準測試(包括Midjourney-30K、GenEval和DPG-Bench)上驗證了AR-RAG的有效性,展示了其相較於最先進圖像生成模型的顯著性能提升。
English
We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm
that enhances image generation by autoregressively incorporating knearest
neighbor retrievals at the patch level. Unlike prior methods that perform a
single, static retrieval before generation and condition the entire generation
on fixed reference images, AR-RAG performs context-aware retrievals at each
generation step, using prior-generated patches as queries to retrieve and
incorporate the most relevant patch-level visual references, enabling the model
to respond to evolving generation needs while avoiding limitations (e.g.,
over-copying, stylistic bias, etc.) prevalent in existing methods. To realize
AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in
Decoding (DAiD), a training-free plug-and-use decoding strategy that directly
merges the distribution of model-predicted patches with the distribution of
retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a
parameter-efficient fine-tuning method that progressively smooths the features
of retrieved patches via multi-scale convolution operations and leverages them
to augment the image generation process. We validate the effectiveness of
AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and
DPG-Bench, demonstrating significant performance gains over state-of-the-art
image generation models.