HART:混合自回归变换器实现高效视觉生成

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

October 14, 2024
作者: Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, Song Han
cs.AI

摘要

我们介绍了混合自回归Transformer(HART),这是一种自回归(AR)视觉生成模型,能够直接生成1024x1024像素的图像,与扩散模型在图像生成质量上相媲美。现有的AR模型面临限制,因为它们的离散标记器在图像重建质量上表现不佳,并且生成1024像素图像的训练成本高昂。为了解决这些挑战,我们提出了混合标记器,将自动编码器中的连续潜变量分解为两个部分:表示整体图像的离散标记和表示离散标记无法表示的残差部分的连续标记。离散部分由可扩展分辨率的离散AR模型建模,而连续部分则通过仅具有3700万参数的轻量级残差扩散模块进行学习。与仅离散VAR标记器相比,我们的混合方法将MJHQ-30K上的重建FID从2.11提高到0.30,导致生成FID从7.85提高到5.38,改善了31%。HART在FID和CLIP分数上均优于最先进的扩散模型,具有4.5-7.7倍的更高吞吐量和6.9-13.4倍的更低MACs。我们的代码在https://github.com/mit-han-lab/hart上开源。
English
We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens. The discrete component is modeled by a scalable-resolution discrete AR model, while the continuous component is learned with a lightweight residual diffusion module with only 37M parameters. Compared with the discrete-only VAR tokenizer, our hybrid approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs. Our code is open sourced at https://github.com/mit-han-lab/hart.

Summary

AI-Generated Summary

PDF182November 16, 2024