PixVerve: 大規模高品質データセットによるネイティブUHR画像生成の100MP対応

要旨

テキストから画像（T2I）モデルは、近年1Kおよび2K解像度において顕著な進歩を遂げている。より優れた視覚体験への強い欲求と画像技術の急速な発展に伴い、超高解像度（UHR）画像生成への需要は大幅に高まっている。しかしながら、高解像度コンテンツの希少性と複雑さにより、UHR画像生成は大きな課題を抱えている。本論文ではまず、慎重に設計されたデータパイプラインを用いてキュレーションされた高品質かつオープンソースのUHR T2Iデータセット「PixVerve-95K」を紹介する。このデータセットは、多様なシナリオにわたる95K枚の画像（各画像の最小ピクセル数は1億）と7次元のアノテーションを含む。我々はこの大規模画像テキストデータセットに基づき、3つの訓練手法を用いて各種T2I基盤モデルをネイティブな1億ピクセル生成へと拡張する先駆的な一歩を踏み出す。最後に、従来の評価指標とマルチモーダル大規模言語モデルに基づく評価の両方を活用し、提案するPixVerve-Benchベンチマークは、画質と意味的整合性を網羅するUHR画像の包括的な評価プロトコルを確立する。本ベンチマークにおける広範な実験結果と訓練戦略の建設的な探求は、将来のブレークスルーに向けた貴重な知見を共同してもたらす。

English

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.