ChatPaper.aiChatPaper

UltraFlux:面向多纵横比高质量原生4K文生图的数据模型协同设计

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

November 22, 2025
作者: Tian Ye, Song Fei, Lei Zhu
cs.AI

摘要

扩散变换器近期已在约1K分辨率下实现了强大的文本到图像生成,但我们发现将其原生扩展至4K分辨率并覆盖多种宽高比时,会暴露出一个涉及位置编码、VAE压缩和优化策略的紧密耦合失效模式。单独解决其中任一因素仍会遗留大量质量问题。为此,我们采用数据-模型协同设计视角,推出基于Flux架构的UltraFlux扩散变换器——该模型在MultiAspect-4K-1M数据集上原生训练至4K分辨率。该数据集包含100万张4K图像,具备可控的多宽高比覆盖、双语标注文本,以及丰富的视觉语言模型/图像质量评估元数据,支持分辨率与宽高比感知的采样策略。在模型层面,UltraFlux融合了四大创新:(i) 采用Resonance 2D RoPE与YaRN相结合的位置编码技术,实现训练窗口感知、频率感知及宽高比感知的4K位置编码;(ii) 通过简洁的非对抗式VAE训练后优化方案提升4K重建保真度;(iii) 设计信噪比感知的Huber小波损失函数,重新平衡不同时间步与频带间的梯度分布;(iv) 引入分阶段美学课程学习策略,将高美学质量的监督信号集中作用于模型先验主导的高噪声步。这些组件共同构建出稳定且细节保持能力出色的4K扩散变换器,可泛化至宽屏、方形及竖屏等多种宽高比。在4096分辨率的美学评估基准测试及多宽高比4K生成任务中,UltraFlux在保真度、美学品质与语义对齐指标上持续超越主流开源基线模型,结合大型语言模型提示词优化器后,其性能更可媲美或超越商用模型Seedream 4.0。
English
Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.
PDF382February 7, 2026