Pensa-Poi-Genera: Diffusione di Immagini da Testo con Ragionamento Consapevole tramite Codificatori LLM

Abstract

I recenti progressi nei modelli di diffusione (DMs) text-to-image (T2I) hanno reso possibile la sintesi visiva di alta qualità a partire da prompt testuali diversificati. Tuttavia, la maggior parte dei T2I DMs esistenti, anche quelli dotati di encoder di testo basati su grandi modelli linguistici (LLM), rimangono mappatori testo-pixel: impiegano gli LLM semplicemente come encoder di testo, senza sfruttare le loro capacità intrinseche di ragionamento per dedurre cosa dovrebbe essere rappresentato visivamente dato il prompt testuale. Per andare oltre questa generazione letterale, proponiamo il paradigma think-then-generate (T2G), in cui l'encoder di testo basato su LLM è incoraggiato a ragionare e riscrivere i prompt utente grezzi; gli stati dei prompt riscritti fungono poi da condizionamento per la diffusione. Per realizzare ciò, attiviamo innanzitutto lo schema think-then-rewrite dell'encoder LLM con un processo di fine-tuning supervisionato leggero. Successivamente, l'encoder LLM e il backbone di diffusione vengono co-ottimizzati per garantire un ragionamento fedele sul contesto e una resa accurata della semantica tramite Dual-GRPO. In particolare, l'encoder di testo viene rinforzato utilizzando ricompense basate sull'immagine per dedurre e richiamare conoscenze del mondo, mentre il backbone di diffusione è spinto a produrre immagini semanticamente coerenti e visivamente convincenti. Gli esperimenti mostrano miglioramenti sostanziali nella coerenza fattuale, nell'allineamento semantico e nel realismo visivo su benchmark di generazione e editing di immagini basati sul ragionamento, raggiungendo uno score di 0.79 su WISE, quasi alla pari con GPT-4. I nostri risultati costituiscono un passo promettente verso modelli unificati di prossima generazione con capacità di ragionamento, espressione e dimostrazione.

English

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.

Pensa-Poi-Genera: Diffusione di Immagini da Testo con Ragionamento Consapevole tramite Codificatori LLM

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Abstract

Support