链式缩放:通过尺度自回归与偏好对齐实现极致超分辨率
Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
May 24, 2025
作者: Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
cs.AI
摘要
现代单图像超分辨率(SISR)模型在训练时所针对的放大倍数下能够生成逼真的结果,但在要求其进行远超该范围的放大时却会失效。我们通过“链式缩放”(Chain-of-Zoom, CoZ)这一模型无关框架解决了这一可扩展性瓶颈,该框架将SISR分解为一系列自回归的中间尺度状态,并辅以多尺度感知提示。CoZ重复利用一个骨干SR模型,将条件概率分解为可处理的子问题,从而在不进行额外训练的情况下实现极端分辨率。由于在高倍放大下视觉线索会减弱,我们在每次缩放步骤中加入了由视觉语言模型(VLM)生成的多尺度感知文本提示。提示提取器本身通过广义奖励策略优化(GRPO)与一个评判VLM进行微调,使文本指导更符合人类偏好。实验表明,一个标准的4倍扩散SR模型在CoZ框架下实现了超过256倍的放大,同时保持了高感知质量和保真度。项目页面:https://bryanswkim.github.io/chain-of-zoom/。
English
Modern single-image super-resolution (SISR) models deliver photo-realistic
results at the scale factors on which they are trained, but collapse when asked
to magnify far beyond that regime. We address this scalability bottleneck with
Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an
autoregressive chain of intermediate scale-states with multi-scale-aware
prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the
conditional probability into tractable sub-problems to achieve extreme
resolutions without additional training. Because visual cues diminish at high
magnifications, we augment each zoom step with multi-scale-aware text prompts
generated by a vision-language model (VLM). The prompt extractor itself is
fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic
VLM, aligning text guidance towards human preference. Experiments show that a
standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement
with high perceptual quality and fidelity. Project Page:
https://bryanswkim.github.io/chain-of-zoom/ .Summary
AI-Generated Summary