체인-오브-줌: 스케일 자기회귀와 선호도 정렬을 통한 극한 초고해상도

초록

현대의 단일 이미지 초해상도(SISR) 모델은 학습된 스케일 팩터에서 사진처럼 사실적인 결과를 제공하지만, 그 범위를 훨씬 넘어서는 확대를 요청받으면 성능이 급격히 저하됩니다. 우리는 이러한 확장성 문제를 해결하기 위해 Chain-of-Zoom(CoZ)을 제안합니다. CoZ는 모델에 구애받지 않는 프레임워크로, SISR을 다중 스케일 인식 프롬프트가 포함된 중간 스케일 상태의 자기회귀적 체인으로 분해합니다. CoZ는 백본 SR 모델을 반복적으로 재사용하며, 조건부 확률을 다루기 쉬운 하위 문제로 분해하여 추가 학습 없이도 극단적인 해상도를 달성합니다. 높은 배율에서 시각적 단서가 감소하기 때문에, 우리는 각 확대 단계를 비전-언어 모델(VLM)이 생성한 다중 스케일 인식 텍스트 프롬프트로 보강합니다. 프롬프트 추출기는 비평가 VLM과 함께 일반화된 보상 정책 최적화(GRPO)를 사용하여 미세 조정되며, 텍스트 지침을 인간의 선호도에 맞춥니다. 실험 결과, CoZ로 감싸진 표준 4x 확산 SR 모델이 256배 이상의 확대에서도 높은 지각적 품질과 충실도를 유지하는 것으로 나타났습니다. 프로젝트 페이지: https://bryanswkim.github.io/chain-of-zoom/ .

English

Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/ .

체인-오브-줌: 스케일 자기회귀와 선호도 정렬을 통한 극한 초고해상도

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

초록

Support