GGT-100K: 汎化可能な実世界画像復元のための生成的グラウンドトゥルース

要旨

実世界画像復元（IR）は、高品質なペアデータの不足によってボトルネックに直面している。合成データセットは豊富に存在するが、実世界の劣化をうまくモデル化できないことが多く、一方で実世界のペアデータセットは収集にコストと労力がかかる。その結果、これらのデータセットで学習されたIRモデルは、実世界のシナリオにおいて限られた汎化性能しか示さない。本研究では、生成的マルチモーダル基盤モデル（MFM）を活用し、実世界の低品質（LQ）画像から高品質（HQ）ターゲットを生成する「生成的グラウンドトゥルース（GGT）」を提案する。まず、Nano-Banana-2やGPT-Image-2を含む9つの最先端MFMについて、様々なシーンや劣化タイプの画像を用いた体系的な評価を行う。その結果、VLMに基づく適応的プロンプティングを備えたNano-Banana-2が、知覚的に現実的で内容に忠実なHQターゲットを合成する能力において最も優れており、LQ入力に対するGGTとして機能できることが示された。次に、Nano-Banana-2を用いてGGT合成パイプラインを構築する。このパイプラインは、データの信頼性を確保するための多段階品質管理を含み、多様なシーンと複雑な実世界の劣化をカバーする103,707組の訓練ペアからなるLQ-HQペアデータセット「GGT-100K」を構築する。また、500組の画像ペアからなるテストセットも作成する。広範な実験により、GGT-100Kが多様なIRモデルの実世界での汎化性能を一貫して向上させ、特に生成モデルをIRタスクに微調整する際に顕著な効果をもたらすことが示された。これらの結果は、MFMが復元指向のデータ生成における実用的なツールとして機能し得ること、およびGGT-100Kが実世界IRモデルの汎化限界を拡張するための有用なリソースであることを示唆している。

English

Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail to model real-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we propose Generative Ground Truth (GGT) by using generative multimodal foundation models (MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate that Nano-Banana-2 with VLM-based adaptive prompting shows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employ Nano-Banana-2 to build a GGT synthesis pipeline, which involves multi-stage quality control to ensure data reliability, and construct GGT-100K, an LQ-HQ paired dataset comprising 103,707 training pairs and covering diverse scenes and complex real-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks. Our results suggest that MFMs can serve as practical tools for restoration-oriented data generation, and GGT-100K is a useful resource to expand the generalization boundaries of real-world IR models.