GRAN-TED: Generazione di Embedding Testuali Robusti, Allineati e Sfumati per Modelli di Diffusione

Abstract

Il codificatore testuale è un componente critico dei modelli di diffusione testo-immagine e testo-video, determinando fondamentalmente la fedeltà semantica del contenuto generato. Tuttavia, il suo sviluppo è stato ostacolato da due sfide principali: la mancanza di un framework di valutazione efficiente che predice in modo affidabile le prestazioni generative downstream e la difficoltà di adattare efficacemente modelli linguistici preaddestrati per la sintesi visiva. Per affrontare questi problemi, introduciamo GRAN-TED, un paradigma per generare embedding testuali Robusti, Allineati e Sfumati per modelli di diffusione. Il nostro contributo è duplice. In primo luogo, proponiamo TED-6K, un nuovo benchmark esclusivamente testuale che consente una valutazione efficiente e robusta della qualità rappresentativa di un codificatore senza richiedere costosi addestramenti end-to-end del modello. Dimostriamo che le prestazioni su TED-6K, standardizzate tramite un adattatore unificato e leggero, correlano fortemente con l'efficacia di un codificatore nelle attività generative downstream. Notevolmente, nella nostra configurazione sperimentale, rispetto all'addestramento di un modello di diffusione da zero, la valutazione con TED-6K è circa 750 volte più veloce. In secondo luogo, guidati da questo framework validato, sviluppiamo un codificatore testuale superiore utilizzando un nuovo paradigma di addestramento a due stadi. Questo processo coinvolge una fase iniziale di fine-tuning su un Modello Linguistico Multimodale di Grande Scala per una migliore rappresentazione visiva, seguita da un metodo di ponderazione strato per strato per estrarre caratteristiche testuali più sfumate e potenti. I nostri esperimenti mostrano che il codificatore GRAN-TED risultante non solo raggiunge prestazioni all'avanguardia su TED-6K, ma porta anche a miglioramenti dimostrabili nelle prestazioni per la generazione testo-immagine e testo-video. Il nostro dataset TED-6K e il codice di valutazione sono disponibili al seguente link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.

English

The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Notably, under our experimental setup, compared with training a diffusion model from scratch, evaluating with TED-6K is about 750times faster. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our TED-6K dataset and evaluation code are available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.

GRAN-TED: Generazione di Embedding Testuali Robusti, Allineati e Sfumati per Modelli di Diffusione

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Abstract

Support