ChatPaper.aiChatPaper

GRAN-TED:为扩散模型生成鲁棒、对齐且细腻的文本嵌入

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

December 17, 2025
作者: Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang
cs.AI

摘要

文本编码器是文生图与文生视频扩散模型的核心组件,从根本上决定了生成内容的语义保真度。然而其发展长期受两大挑战制约:一是缺乏能够可靠预测下游生成性能的高效评估框架,二是难以有效适配预训练语言模型以实现视觉合成。为此,我们提出GRAN-TED范式,旨在为扩散模型生成鲁棒、对齐且细腻的文本嵌入。我们的贡献包含两方面:首先,我们提出TED-6K——一个纯文本评估基准,通过轻量级统一适配器实现标准化评估,无需昂贵的端到端模型训练即可高效衡量编码器的表征质量。实验表明,TED-6K的评估结果与编码器在下游生成任务中的效能高度相关。值得注意的是,在我们的实验设置下,相较于从头训练扩散模型,使用TED-6K进行评估速度提升约750倍。其次,基于该验证框架的指导,我们通过新颖的两阶段训练范式开发出更优的文本编码器:先对多模态大语言模型进行微调以增强视觉表征能力,再采用分层加权方法提取更细腻、强效的文本特征。实验证明,所得GRAN-TED编码器不仅在TED-6K上达到最优性能,还在文生图与文生视频任务中带来显著性能提升。TED-6K数据集与评估代码已公开:https://anonymous.4open.science/r/GRAN-TED-4FCC/。
English
The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Notably, under our experimental setup, compared with training a diffusion model from scratch, evaluating with TED-6K is about 750times faster. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our TED-6K dataset and evaluation code are available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.
PDF211December 31, 2025