PixVerve：利用大规模高质量数据集推动原生超高清图像生成达到1亿像素

摘要

文本到图像（T2I）模型近期在1K和2K分辨率方面取得了显著进展。随着对更佳视觉体验的强烈追求以及成像技术的快速发展，超高分辨率（UHR）图像生成的需求显著增长。然而，由于高分辨率内容的稀缺性和复杂性，UHR图像生成面临巨大挑战。本文首先介绍PixVerve-95K——一个经精心设计数据流程筛选的高质量开源UHR T2I数据集，包含95K幅涵盖多样场景的图像（每幅图像最低像素数量达1亿）及七维注释。基于这一大规模图文数据集，我们率先通过三种训练方案，将多种T2I基础模型扩展至原生1亿像素生成。最后，结合传统指标与基于多模态大语言模型的评估方法，我们提出的PixVerve-Bench基准为UHR图像建立了一套涵盖视觉质量与语义一致性的综合评估协议。本基准上的广泛实验结果以及对训练策略的建设性探索，共同为未来突破提供了宝贵见解。

English

Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.