Rapporto Tecnico di HunyuanImage 3.0

Abstract

Presentiamo HunyuanImage 3.0, un modello multimodale nativo che unisce comprensione e generazione multimodale all'interno di un framework autoregressivo, con il modulo di generazione di immagini reso pubblicamente disponibile. Il successo di HunyuanImage 3.0 si basa su diversi componenti chiave, tra cui una curatela meticolosa dei dati, un design avanzato dell'architettura, uno schema nativo di Chain-of-Thoughts, un pre-training progressivo del modello, un post-training aggressivo e un'infrastruttura efficiente che consente addestramento e inferenza su larga scala. Con questi progressi, abbiamo addestrato con successo un modello Mixture-of-Experts (MoE) composto da oltre 80 miliardi di parametri in totale, con 13 miliardi di parametri attivati per token durante l'inferenza, rendendolo il modello generativo di immagini open source più grande e potente fino ad oggi. Abbiamo condotto esperimenti estesi e i risultati delle valutazioni automatiche e umane sull'allineamento testo-immagine e sulla qualità visiva dimostrano che HunyuanImage 3.0 rivaleggia con i precedenti modelli all'avanguardia. Rilasciando il codice e i pesi di HunyuanImage 3.0, miriamo a consentire alla comunità di esplorare nuove idee con un modello di base all'avanguardia, favorendo un ecosistema multimodale dinamico e vivace. Tutte le risorse open source sono disponibili pubblicamente all'indirizzo https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.

English

We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0

Rapporto Tecnico di HunyuanImage 3.0

HunyuanImage 3.0 Technical Report

Abstract

Support