鏈式縮放:通過尺度自回歸與偏好對齊實現極致超分辨率
Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
May 24, 2025
作者: Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
cs.AI
摘要
現代單圖像超分辨率(SISR)模型在訓練時的放大倍率下能呈現出逼真的效果,但在遠超該範圍的放大需求下則會失效。我們通過鏈式放大(Chain-of-Zoom, CoZ)這一模型無關框架來解決這一可擴展性瓶頸,該框架將SISR分解為一系列自迴歸的中間尺度狀態,並配備多尺度感知提示。CoZ反覆利用一個骨幹SR模型,將條件概率分解為可處理的子問題,從而無需額外訓練即可實現極端分辨率。由於在高倍放大下視覺線索會減弱,我們在每個放大步驟中加入了由視覺語言模型(VLM)生成的多尺度感知文本提示。提示提取器本身通過廣義獎勵策略優化(GRPO)與一個評判VLM進行微調,使文本指導更符合人類偏好。實驗表明,包裹在CoZ中的標準4倍擴散SR模型能夠實現超過256倍的放大,同時保持高感知質量和保真度。項目頁面:https://bryanswkim.github.io/chain-of-zoom/。
English
Modern single-image super-resolution (SISR) models deliver photo-realistic
results at the scale factors on which they are trained, but collapse when asked
to magnify far beyond that regime. We address this scalability bottleneck with
Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an
autoregressive chain of intermediate scale-states with multi-scale-aware
prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the
conditional probability into tractable sub-problems to achieve extreme
resolutions without additional training. Because visual cues diminish at high
magnifications, we augment each zoom step with multi-scale-aware text prompts
generated by a vision-language model (VLM). The prompt extractor itself is
fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic
VLM, aligning text guidance towards human preference. Experiments show that a
standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement
with high perceptual quality and fidelity. Project Page:
https://bryanswkim.github.io/chain-of-zoom/ .Summary
AI-Generated Summary