ChatPaper.aiChatPaper

个性化文本到图像生成的定向文本反转

Directional Textual Inversion for Personalized Text-to-Image Generation

December 15, 2025
作者: Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung Shim
cs.AI

摘要

文本反演(TI)是一种高效的文本到图像个性化方法,但在复杂提示词上常表现不佳。我们发现其失败根源在于嵌入范数膨胀:学习到的词元会偏离正常分布范围,降低预归一化Transformer中的提示词条件控制效果。实验表明CLIP词元空间中的语义信息主要由方向编码,而膨胀的范数会损害上下文关联性;理论上我们分析了大范数如何削弱位置信息并阻碍预归一化模块的残差更新。我们提出方向性文本反演(DTI),将嵌入范数固定于正常分布尺度,并通过黎曼随机梯度下降在单位超球面上仅优化方向。我们将方向学习建模为带有冯·米塞斯-费希尔先验的最大后验估计,产生恒定方向先验梯度,该方法简单高效。在各类个性化任务中,DTI在保持主体相似度的同时,比TI及其变体具有更好的文本保真度。关键的是,DTI的超球面参数化支持学习概念间的平滑语义连贯插值(球面线性插值),这是标准TI所缺失的能力。我们的研究表明,纯方向优化是实现提示词忠实个性化的稳健且可扩展的路径。
English
Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.
PDF22December 17, 2025