GGT-100K: 일반화 가능한 실세계 이미지 복원을 위한 생성적 기준 참조

초록

실제 세계 이미지 복원(IR)은 고품질의 쌍을 이루는 훈련 데이터 부족으로 인해 병목 현상이 발생한다. 합성 데이터셋은 풍부하지만 종종 실제 세계 열화를 모델링하지 못하는 반면, 실제 세계 쌍 데이터셋은 획득 비용이 높고 포착이 어렵다. 결과적으로, 이러한 데이터셋으로 훈련된 IR 모델은 실제 세계 시나리오에서 제한된 일반화 성능을 보인다. 본 연구에서는 생성적 다중 모달 기반 모델(MFM)을 활용하여 실제 세계 저품질(LQ) 이미지로부터 고품질(HQ) 대상을 생성하는 생성적 실측 자료(GGT)를 제안한다. 먼저 Nano-Banana-2 및 GPT-Image-2를 포함한 9개의 최첨단 MFM을 다양한 장면과 열화 유형의 이미지에 대해 체계적으로 평가한다. 그 결과, VLM 기반 적응형 프롬프팅을 적용한 Nano-Banana-2가 지각적으로 현실적이고 내용에 충실한 HQ 대상을 합성하는 데 가장 뛰어난 능력을 보여주며, 이는 LQ 입력에 대한 GGT로 활용될 수 있다. 이후 Nano-Banana-2를 사용하여 GGT 합성 파이프라인을 구축하고, 데이터 신뢰성을 보장하기 위한 다단계 품질 관리를 도입하며, 다양한 장면과 복잡한 실제 세계 열화를 포괄하는 103,707개의 훈련 쌍으로 구성된 LQ-HQ 쌍 데이터셋 GGT-100K를 구축한다. 또한 500개의 이미지 쌍으로 구성된 테스트 세트도 마련한다. 광범위한 실험 결과, GGT-100K는 다양한 IR 모델의 실제 세계 일반화를 일관되게 향상시키며, 특히 IR 작업을 위한 생성 모델 미세 조정에 큰 이점을 제공한다. 본 연구 결과는 MFM이 복원 지향 데이터 생성을 위한 실용적인 도구로 활용될 수 있으며, GGT-100K가 실제 세계 IR 모델의 일반화 경계를 확장하는 데 유용한 자원임을 시사한다.

English

Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail to model real-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we propose Generative Ground Truth (GGT) by using generative multimodal foundation models (MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate that Nano-Banana-2 with VLM-based adaptive prompting shows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employ Nano-Banana-2 to build a GGT synthesis pipeline, which involves multi-stage quality control to ensure data reliability, and construct GGT-100K, an LQ-HQ paired dataset comprising 103,707 training pairs and covering diverse scenes and complex real-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks. Our results suggest that MFMs can serve as practical tools for restoration-oriented data generation, and GGT-100K is a useful resource to expand the generalization boundaries of real-world IR models.