CoRe^2: Coletar, Refletir e Refinar para Gerar Melhor e Mais Rápido

Resumo

Tornar a amostragem de modelos generativos texto-imagem (T2I) rápida e de alta qualidade representa uma direção de pesquisa promissora. Estudos anteriores geralmente se concentraram em melhorar a qualidade visual das imagens sintetizadas em detrimento da eficiência de amostragem ou em acelerar drasticamente a amostragem sem melhorar a capacidade generativa do modelo base. Além disso, quase todos os métodos de inferência não conseguiram garantir desempenho estável simultaneamente em modelos de difusão (DMs) e modelos autoregressivos visuais (ARMs). Neste artigo, introduzimos um novo paradigma de inferência plug-and-play, CoRe^2, que compreende três subprocessos: Coletar, Refletir e Refinar. O CoRe^2 primeiro coleta trajetórias de orientação sem classificador (CFG) e, em seguida, usa os dados coletados para treinar um modelo fraco que reflete os conteúdos fáceis de aprender, reduzindo pela metade o número de avaliações de função durante a inferência. Posteriormente, o CoRe^2 emprega orientação de fraco para forte para refinar a saída condicional, melhorando assim a capacidade do modelo de gerar conteúdo de alta frequência e realista, que é difícil para o modelo base capturar. Até onde sabemos, o CoRe^2 é o primeiro a demonstrar eficiência e eficácia em uma ampla gama de DMs, incluindo SDXL, SD3.5 e FLUX, bem como ARMs como LlamaGen. Ele exibiu melhorias significativas de desempenho em HPD v2, Pick-of-Pic, Drawbench, GenEval e T2I-Compbench. Além disso, o CoRe^2 pode ser integrado perfeitamente com o estado da arte Z-Sampling, superando-o em 0,3 e 0,16 em PickScore e AES, enquanto economiza 5,64s usando SD3.5. O código foi liberado em https://github.com/xie-lab-ml/CoRe/tree/main.

English

Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using SD3.5.Code is released at https://github.com/xie-lab-ml/CoRe/tree/main.

CoRe^2: Coletar, Refletir e Refinar para Gerar Melhor e Mais Rápido

CoRe^2: Collect, Reflect and Refine to Generate Better and Faster

Resumo

Support