Chain-of-Zoom: スケール自己回帰と選好アライメントによる極限超解像

要旨

現代の単一画像超解像（SISR）モデルは、訓練されたスケールファクターにおいては写真のようなリアルな結果を提供しますが、その範囲を大幅に超えて拡大する場合には性能が低下します。このスケーラビリティのボトルネックに対処するため、我々はChain-of-Zoom（CoZ）を提案します。CoZはモデルに依存しないフレームワークであり、SISRを中間スケール状態の自己回帰的連鎖に分解し、マルチスケールを意識したプロンプトを組み込みます。CoZはバックボーンのSRモデルを繰り返し再利用し、条件付き確率を扱いやすいサブ問題に分解することで、追加の訓練なしに極端な解像度を実現します。高倍率では視覚的な手がかりが減少するため、各ズームステップにビジョン言語モデル（VLM）によって生成されたマルチスケールを意識したテキストプロンプトを追加します。このプロンプト抽出器自体は、Generalized Reward Policy Optimization（GRPO）を使用して批評家VLMで微調整され、テキストガイダンスを人間の好みに合わせます。実験では、CoZでラップされた標準的な4倍拡散SRモデルが、256倍を超える拡大において高い知覚品質と忠実度を達成することが示されています。プロジェクトページ: https://bryanswkim.github.io/chain-of-zoom/

English

Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/ .

Chain-of-Zoom: スケール自己回帰と選好アライメントによる極限超解像

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

要旨

Support